I recently spent an unhealthy amount of days troubleshooting performance issues between remote Data Centers. Good thing I did, too, as I got a friendly reminder about TCP, and how latency drives throughput.
We were seeing seemingly inconsistent network issues, some applications and file transfers were slow, some were fast, and some appeared to be slow in only one direction. Jobs that used to run in minutes, now took hours. Packet captures were showing possible signs of packet loss with DUP ACKs, etc. Needless to say, we had some troubleshooting ahead of us to either locate the source of packet loss, or rule out the network.
First, we verified basic health of the network:
- Validated no congestion on links between endpoints
- Validate interface counters – look for errors, CRCs, drops, etc.
- Validate interface configurations – speed, duplex, MTU, etc.
- Validate QoS stats – Are you seeing an unusually high number of drops in a particular queue?
- Validate network device system resources – CPU, Memory, etc.
- Validate the control plane – Are packets getting punted to the router CPU?
All good so far, so why are we seeing the slowness with file transfers? Throughput on some transfers are as low as 900KBps. We have a 1Gbps link between sites, with only 18ms of latency (round-trip time / RTT), we should have no issue with throughput!
Looking at the packet captures, we learned TCP Windows were advertising at a very small size of 17,520 bytes, and not scaling.
This is a problem because of this very simple equation: