TCP zero windows

Yet another reason downloads can fail…

Issue :

Large file is being downloaded (eg an ISO running above 500MBs). The file starts off downloading fine, but eventually stops downloading, leaving the file incomplete.

Cause (in this case) :

TCP zero windows caused the server to reset the connection

Troubleshooting :

In wireshark, run the following filter:

tcp.analysis.flags or tcp.flags.reset==1 and ip.addr==

where is the IP address of the server you are downloading from. The filter showed this:


Note how the client is sending the TCP ZeroWindow, and eventually the server sends a RESET packet. The solution in this case was to reduce the tcp window size on the client (effectively slightly slowing down the connection)

Theory behind this :

The TCP window size is, in  a nutshell, the receive buffer size of a host. So, for example if the window size is set to 65535, then the sending host can send 65535 bytes of data to the receiver and only then wait for an acknowledgement. So, the larger the window size, the less times the sender has to stop and wait for an acknowledgement, speeding things up.

(FYI there is plenty of reading on TCP flow control, such as windows scaling, selective acknowledgements and so on…)

Conversely, a smaller window size means the sender must stop more often to wait for an ACK. At the extreme, when the receiver sends a TCP ZeroWindow it effectively is telling the sender:

“My receive buffer is full… wait until I can clear it”

The server waits a while, then checks the window size again (that’s the TCP ZeroWindowProbe), and if the client still has a window size of 0 (that’s the TCP ZeroWindowProbeAck), then it has to wait some more before sending data.

If this goes on long enough, the server will reset the connection.


TIME_WAIT and “port reuse”

Lately during some support work, a customer raised an interesting case regarding what was referred to as “port reuse”. This lead to quite a nice investigation on the effect of the MSL and TIME_WAIT characteristics of TCP. So first we should define these terms and what exactly they mean. Getting an exact definition can be more difficult than expected because the RFC states one thing, but allows vendors the flexibility to change their defaults for better performance. The best  I could come up with is…

MSL: Maximum Segment Lifetime.

This is defined quite well in it’s wikipedia entry:

“Maximum Segment Lifetime is the time a TCP segment can exist in the internetwork system. It is arbitrarily defined to be 2 minutes long”


This is a valid TCP state. Imagine the scenario

  1. Client opens a connection “A” to Server
  2. Normal TCP operation (3-way handshake, data transfer, ACK, so on so forth)
  3. Client and server terminate their connection “A” via the use of FIN packets
  4. Client opens another connection “B” to Server
  5. Normal TCP operation
  6. For whatever reason, for example network congestion, latency, high CPU on intermediate nodes, a TCP packet from connection “A” arrives to the server. Should the server accept this packet? Mark it as a duplicate? Deny it?

In order to solve the above problem, we have the TIME_WAIT state. TCP requires that the endpoint that initiates an active close of the connection eventually enters TIME_WAIT. Usually TIME_WAIT = 2MSL. In other words:

If a client or server initiates an active close (using FIN packets), then wait for 2MSL before allowing the same socket to be used (i.e. the same IP addresses / TCP port numbers).

You can inherently see why this makes sense and why it should avoid the problem described in the steps above. If all TCP packets must “die” after time period “MSL”, then surely waiting for twice that amount of time would mean that no TCP packets from the old connection could possibly still exist on the network. I found a pretty good, if somewhat unclear, diagram over at the University of California webpage:

A couple of points you should notice. As I highlighted in bold before, only the side that actively closes the TCP connection enters into the TIME_WAIT state. You see this illustrated above in the arrow coming from the “ESTABLISHED” state saying “appl:close” which presumably stands for “application is closed”, which generates a FIN packet illustrated with the “send: FIN”. Which side this will be is highly dependant on the application that is being used. It can either be the server or the client that goes into TIME_WAIT.

All well and good. This solves the problem I illustrated before. But it introduces a new problem: starvation of resources. For example, on a terminal server with very low traffic we already see several TIME_WAIT connections, and keep in mind these connections cannot be used again for 2MSL

So you see where this is going… in very high activity application network nodes such as proxies (Bluecoat) we may end up in a situation where we don’t have any more sockets to spare since they are all in the TIME_WAIT state.

The RFC, anticipating (or reacting to) this problem, states:

“When a connection is closed actively, it MUST linger in TIME-WAIT state for a time 2xMSL (Maximum Segment Lifetime). However, it MAY accept a new SYN from the remote TCP to reopen the connection directly from TIME-WAIT state, if it:

(1)  assigns its initial sequence number for the new connection to be larger than the largest sequence number it used on the previous connection incarnation, and

(2)  returns to TIME-WAIT state if the SYN turns out to be an old duplicate”

Ok, so focusing on point (1). This says, we can reuse the same socket but only if the SYN packet contains a sequence number which is larger than was previously used. Simple to follow… but not when you introduce factors like NAT and non compliant client OS. In most of these cases, the same socket is used (especially in NAT) but the sequence number is not changed… so the upstream server or proxy must reset the connection and you get dropped connections and lots of moaning from users.

The obvious answer would be to ensure everyone uses this sequence number which is higher than previously used. Easily said than done. The sheer variety of connections / clients / applications / proxies that necessarily need to communicate means you cannot easily keep track of what sequence numbers were used and so on. Especially considering high-traffic networks can have connections numbering in the millions, and the 2MSL time period means all sequence numbers must be recorded for 4 minutes. That would be an architectural headache for any node that tries to do this.

Instead, the RFC makes a liberal statement and says that vendors are allowed to reduce the 2MSL period so that TIME_WAIT states will last a shorter amount of time so the ports can be reused. This makes it easier to work with. Let’s say we now have two network nodes (these could be a client / server pair, or two forwarding proxies, or what have you… any two nodes that terminate connections). Let’s also say that you have a well-designed and well-behaved network in which you rarely see any fragmented or late packets… so the problem I described in those previous 6 steps are of little concern to you. Let’s finally say that you have a high volume network which presents you with the TIME_WAIT resource starvation I just described above. We’re getting close to an ideal situation where shortening the 2MSL timeout on one side of the connection is a valid solution.

Where we hit a snag… it’s not always that easy to determine on which node to apply the 2MSL reduction. Ideally, one should observe which sockets are being reset due to port reuse, track which applications generate these sockets, and see under normal operation which side of the connection should initiate a TCP connection FIN. In practice, there rarely is the time or will do all this. It’s should be easy to spot which side is resetting the connections.

For example. Let’s say that like my situation you have two BlueCoat proxies. One is forwarding connections to the other. So we have a downstream proxy sending connections to the upstream proxy. Using a PCAP we observe that the upstream proxy is sending all the RST packets because the downstream proxy is re-using the sockets. While it’s not exact, we can assume to a certain degree that if we:

  1. Leave the 2MSL timeout on the downstream proxy at the normal 4 minute time period, this means that if the client starts a connection closure, there will be a relatively long period until the downstream tries to re-use the ports
  2. Reduce the 2MSL timeout on the upstream proxy. This means that if the server starts the connection closure, there will be a relatively short period of time until the upstream will release the ports for re-use

Combined, the above two steps should ensure that even if we dont know which side starts the FIN closure, we can say that probably the downstream will not use the same ports in a longish time, and the upstream will free up the ports for use after a shortish time.

Let me know if you ever ran into this situation and if the above makes sense 🙂