TIME_WAIT and “port reuse”
Lately during some support work, a customer raised an interesting case regarding what was referred to as “port reuse”. This lead to quite a nice investigation on the effect of the MSL and TIME_WAIT characteristics of TCP. So first we should define these terms and what exactly they mean. Getting an exact definition can be more difficult than expected because the RFC states one thing, but allows vendors the flexibility to change their defaults for better performance. The best I could come up with is…
MSL: Maximum Segment Lifetime.
This is defined quite well in it’s wikipedia entry:
This is a valid TCP state. Imagine the scenario
- Client opens a connection “A” to Server
- Normal TCP operation (3-way handshake, data transfer, ACK, so on so forth)
- Client and server terminate their connection “A” via the use of FIN packets
- Client opens another connection “B” to Server
- Normal TCP operation
- For whatever reason, for example network congestion, latency, high CPU on intermediate nodes, a TCP packet from connection “A” arrives to the server. Should the server accept this packet? Mark it as a duplicate? Deny it?
In order to solve the above problem, we have the TIME_WAIT state. TCP requires that the endpoint that initiates an active close of the connection eventually enters TIME_WAIT. Usually TIME_WAIT = 2MSL. In other words:
If a client or server initiates an active close (using FIN packets), then wait for 2MSL before allowing the same socket to be used (i.e. the same IP addresses / TCP port numbers).
You can inherently see why this makes sense and why it should avoid the problem described in the steps above. If all TCP packets must “die” after time period “MSL”, then surely waiting for twice that amount of time would mean that no TCP packets from the old connection could possibly still exist on the network. I found a pretty good, if somewhat unclear, diagram over at the University of California webpage:
A couple of points you should notice. As I highlighted in bold before, only the side that actively closes the TCP connection enters into the TIME_WAIT state. You see this illustrated above in the arrow coming from the “ESTABLISHED” state saying “appl:close” which presumably stands for “application is closed”, which generates a FIN packet illustrated with the “send: FIN”. Which side this will be is highly dependant on the application that is being used. It can either be the server or the client that goes into TIME_WAIT.
All well and good. This solves the problem I illustrated before. But it introduces a new problem: starvation of resources. For example, on a terminal server with very low traffic we already see several TIME_WAIT connections, and keep in mind these connections cannot be used again for 2MSL
So you see where this is going… in very high activity application network nodes such as proxies (Bluecoat) we may end up in a situation where we don’t have any more sockets to spare since they are all in the TIME_WAIT state.
The RFC, anticipating (or reacting to) this problem, states:
“When a connection is closed actively, it MUST linger in TIME-WAIT state for a time 2xMSL (Maximum Segment Lifetime). However, it MAY accept a new SYN from the remote TCP to reopen the connection directly from TIME-WAIT state, if it:
(1) assigns its initial sequence number for the new connection to be larger than the largest sequence number it used on the previous connection incarnation, and
(2) returns to TIME-WAIT state if the SYN turns out to be an old duplicate”
Ok, so focusing on point (1). This says, we can reuse the same socket but only if the SYN packet contains a sequence number which is larger than was previously used. Simple to follow… but not when you introduce factors like NAT and non compliant client OS. In most of these cases, the same socket is used (especially in NAT) but the sequence number is not changed… so the upstream server or proxy must reset the connection and you get dropped connections and lots of moaning from users.
The obvious answer would be to ensure everyone uses this sequence number which is higher than previously used. Easily said than done. The sheer variety of connections / clients / applications / proxies that necessarily need to communicate means you cannot easily keep track of what sequence numbers were used and so on. Especially considering high-traffic networks can have connections numbering in the millions, and the 2MSL time period means all sequence numbers must be recorded for 4 minutes. That would be an architectural headache for any node that tries to do this.
Instead, the RFC makes a liberal statement and says that vendors are allowed to reduce the 2MSL period so that TIME_WAIT states will last a shorter amount of time so the ports can be reused. This makes it easier to work with. Let’s say we now have two network nodes (these could be a client / server pair, or two forwarding proxies, or what have you… any two nodes that terminate connections). Let’s also say that you have a well-designed and well-behaved network in which you rarely see any fragmented or late packets… so the problem I described in those previous 6 steps are of little concern to you. Let’s finally say that you have a high volume network which presents you with the TIME_WAIT resource starvation I just described above. We’re getting close to an ideal situation where shortening the 2MSL timeout on one side of the connection is a valid solution.
Where we hit a snag… it’s not always that easy to determine on which node to apply the 2MSL reduction. Ideally, one should observe which sockets are being reset due to port reuse, track which applications generate these sockets, and see under normal operation which side of the connection should initiate a TCP connection FIN. In practice, there rarely is the time or will do all this. It’s should be easy to spot which side is resetting the connections.
For example. Let’s say that like my situation you have two BlueCoat proxies. One is forwarding connections to the other. So we have a downstream proxy sending connections to the upstream proxy. Using a PCAP we observe that the upstream proxy is sending all the RST packets because the downstream proxy is re-using the sockets. While it’s not exact, we can assume to a certain degree that if we:
- Leave the 2MSL timeout on the downstream proxy at the normal 4 minute time period, this means that if the client starts a connection closure, there will be a relatively long period until the downstream tries to re-use the ports
- Reduce the 2MSL timeout on the upstream proxy. This means that if the server starts the connection closure, there will be a relatively short period of time until the upstream will release the ports for re-use
Combined, the above two steps should ensure that even if we dont know which side starts the FIN closure, we can say that probably the downstream will not use the same ports in a longish time, and the upstream will free up the ports for use after a shortish time.
Let me know if you ever ran into this situation and if the above makes sense :)