Friday, May 26, 2017

TCP Operations

IPv4 and IPv6 PMTU

As discussed earlier MTU, Maximum Transmission Unit, is “the size of the largest network layer protocol data unit that can be communicated in a single network transaction”.
  • PMTU, Path MTU, would be the smallest common detonator along a specific path.
  • PMTUD, PMTU Discovery, is a process by which devices can learn the smallest value along a path. PMTUD exists for both IPv4 and IPv6 and are fairly similar, but do have some differences.
An IPv4 packet which has a size greater than the MTU of a device can be subject to fragmentation. This is often not ideal as fragmentation increases the processing of a packet. Remember, a fragmented packet is only reassembled at the destination. If the IPv4 packet has the DF (Don’t Fragment) bit set in the IP header, then, rather than fragmenting, the forwarding device will drop the packet. This is similar to IPv6 in operation as fragmentation is not allowed by network routers.
When a router drops the packet due to a packet size greater than the MTU, the router will reply to the source with an ICMP message. Though, type and code are different, the message is similar between ICMP and ICMPv6.
  • ICMP – a Type 3 Code 4 ICMP message is generated and sent back to the IP source. The MTU size that caused the packet to drop is included in the data returned as the Next-hop MTU.
  • ICMPv6 – a Type 2 (Code 0) ICMPv6 message is generated. There are no other codes for ICMPv6 Type 2. Similarly, the MTU size is added to the ICMPv6 message and send to the IP source.


Understanding TCP MSS is key to understanding why the PMTUD process works. MSS, or maximum-receive-segment-size, is a parameter of the options field of the TCP header that specifies the largest amount of data, in bytes and not counting the TCP or IP header, that a device can receive in a single TCP segment. This is not a negotiated value, rather established by the end hosts involved in the communication. The default MSS value is 536 and is modified by a TCP MSS option. This is first done during the initiation of the TCP handshake, but can be modified at any time during the lifetime of the TCP connection.
  • The initiating host will include its network MTU in the first TCP SYN packet, setting the TCP MSS value
  • The MSS of the TCP connection is now set to this MSS value
  • The responding hosts will, if necessary, lower the MSS value to its network MTU, sending the value in the TCP MSS options field of the SYN\ACK packet
  • At this point, the TCP MSS options field has the value of the largest sized segment that can be sent by either host
  • The final ACK handshake completes build the TCP connection
  • Once an MSS value is set, it should not be increased
This process works to allow each end host to effectively know the MTU of the other end of the TCP connection, but does nothing to identify a smaller MTU along the path. This is where the ICMP and ICMPv6 unreachable messages come into play, informing one of the end hosts about a smaller Path MTU. The offending host will then apply this new value to the TCP MSS option. The far end host will be identified of this new value when receiving the next TCP segment.
Another way to modify the TCP MSS value, outside of end host communication, is to have an in-path router modify the TCP MSS value of inflight segments. This technique can be used to preemptively modify the MSS to a lower specific value to avoid a packet from being dropped and ICMP message generated. 


Latency is also known as round-trip delay time (RTD) or route-trip time (RTT). RTT is the length of time it takes for a signal to be send from one host to another plus the length of time for the reply. TCP is a connection oriented protocol, meaning that it strives to ensure successful delivery of information. This is achieved by the use of acknowledgements that segments have arrived. While great for ensuring data delivery, acknowledgements to come with the drawbacks of having to be received before more data can be sent. This means that a network path with a high RTT can perform slower due to the need to wait for ACKs.

  • Example: suppose a network path has a maximum bandwidth of 100 mbps but a RTT of 200 ms, and a host wants to transmit a large file over this path. Sure, the host can send a full window of data at 100 mbps, but will have to wait for far end host to receive the data, taking 100 ms. When the TCP window is full, the receiving host will have to acknowledge the receipt of the data, taking another 100 ms. Quickly, the file transfer starts to have performance issues.
There are ways to alleviate this problem.

  • Increase the TCP Window size
    • Using the TCP window scaling option, the window can be increased from a 16-bit to a 32-bit value
  • The TCP selective acknowledgements options help to identify multiple missed segments, preventing an entire window of data from being held up
  • Switch to a connectionless protocol, like UDP, that does not require acknowledgements
    • A UDP connection will suffer from the initial delay to reach the end host, but, since UDP doesn’t stop an wait, data can be continuously sent

Windowing (Window Scaling)

TCP Window Size is a 16-bit value (65536 bytes standard max) in the TCP header (NOT option). It is used to identify to the far end host how much buffer space is allocated for the TCP connection. In other words, one host is telling the other host how much unacknowledged data can be sent. This works both ways. The window size value is adjusted up or down based on performance. If the receive buffers are full, a host can respond with a window size of 0, stopping all further payload from being sent. Once buffer space is freed, the window size can be increased, resuming data transmission. Also, once an entire window of data has been sent, the sending host must wait for the window acknowledgement before sending any more data.

On connections with high bandwidth AND high RTT (delay), it can be easier to fill a smaller window buffer. Window scaling TCP option can be used to increase, or scale, the window size variable from a 16-bit to a pseudo 32-bit value (4+ billion bytes max). This is an option that must be set by both end hosts at the initiation of the TCP handshake, in the SYN and SYN-ACK packets. The window value is adjusted as normal; however, the bits now represent the new, scaled values.

Bandwidth Delay Product

BDP, or Bandwidth Delay Product, refers to the maximum amount of data on a network circuit at any given time. The product is the multiple of a link’s capacity (in bps) and RTT (Round Trip Time or latency). A lower BDP is desired as this value is also the amount of data that has not been acknowledged. A high BDP, known as Long-Delay paths, can create TCP window buffer issues.

  • bps * delay
    • or simply, B*D

Global Synchronization

Global synchronization is a network phenomenon where in multiple TCP flows in a high utilization environment experience a congestion event. Being a well-known and established standard, TCP behaves in a predictable and uniform fashion. In this case, though, this works against TCP as all the flows perform similar, if not the same, techniques to avoid congestion. Sensing packet loss, TCP invokes a “slow-start” algorithm, backing off sending flows in an attempt to let the network stabilize, then slowly increasing (or ramping up) transmission speed at the same time. As a result, this triggers the “slow-start” algorithm again, having all the TCP flows back off sending data. If multiple TCP begin behaving in this same manor, they can be thought of as synchronized, causing the network to appear to continually shutter.

This phenomenon is generally attributed to a tail drop mechanism used to handle congestion. Tail drop can be thought of as a first in, first out queue for an interface buffer. Once a packet enters the buffer it is set to be processed. All subsequent packets trying to enter a full buffer are dropped until packets are able to be processed and buffer space becomes available. This queueing mechanism effects each TCP flow evenly, causing each TCP flow to react in the same way. Tail drop is especially vulnerable to very bursty traffic. 

Queuing mechanisms thought to reduce the likelihood of global synchronization:

  • Random Early Detection (RED) – Once the interface buffer is full, RED tries to drop packets based on the number of packets a host has in the buffer. The original implementation does not leverage QoS for intendent drops.
  • Weighted Random Early Detection (WRED) – An extension to standard RED, WRED can assign a queue threshold based on a traffic class, generally a QoS (DSCP or CoS) marking is used to add traffic to a class. 

TCP Options

TCP options can be included into the TCP header, just before the data payload. The options fields start at bit 256. TCP options are identified by a “kind”, similar in meaning to type. 

Interesting TCP options include:

  • Kind 3 - Window Scale
  • Kind 4 and 5 – SACK (Selective Acknowledgment)
  • Kind 8 – Timestamps
  • Kind 14 and 15 – TCP Alternate Checksum