IPv4 and IPv6 PMTU
As discussed earlier MTU, Maximum Transmission Unit, is “the
size of the largest network layer protocol data unit that can be communicated
in a single network transaction”.
- PMTU, Path MTU, would be the smallest common detonator along a specific path.
- PMTUD, PMTU Discovery, is a process by which devices can learn the smallest value along a path. PMTUD exists for both IPv4 and IPv6 and are fairly similar, but do have some differences.
An IPv4 packet which has a size greater than the MTU of a
device can be subject to fragmentation. This is often not ideal as
fragmentation increases the processing of a packet. Remember, a fragmented
packet is only reassembled at the destination. If the IPv4 packet has the DF
(Don’t Fragment) bit set in the IP header, then, rather than fragmenting, the
forwarding device will drop the packet. This is similar to IPv6 in operation as
fragmentation is not allowed by network routers.
When a router drops the packet due to a packet size greater
than the MTU, the router will reply to the source with an ICMP message. Though,
type and code are different, the message is similar between ICMP and ICMPv6.
- ICMP – a Type 3 Code 4 ICMP message is generated and sent back to the IP source. The MTU size that caused the packet to drop is included in the data returned as the Next-hop MTU.
- ICMPv6 – a Type 2 (Code 0) ICMPv6 message is generated. There are no other codes for ICMPv6 Type 2. Similarly, the MTU size is added to the ICMPv6 message and send to the IP source.
TCP MSS
Understanding TCP MSS is key to understanding why the PMTUD
process works. MSS, or maximum-receive-segment-size, is a parameter of the
options field of the TCP header that specifies the largest amount of data, in
bytes and not counting the TCP or IP header, that a device can receive in a
single TCP segment. This is not a negotiated value, rather established by the
end hosts involved in the communication. The default MSS value is 536 and is
modified by a TCP MSS option. This is first done during the initiation of the
TCP handshake, but can be modified at any time during the lifetime of the TCP
connection.
- The initiating host will include its network MTU in the first TCP SYN packet, setting the TCP MSS value
- The MSS of the TCP connection is now set to this MSS value
- The responding hosts will, if necessary, lower the MSS value to its network MTU, sending the value in the TCP MSS options field of the SYN\ACK packet
- At this point, the TCP MSS options field has the value of the largest sized segment that can be sent by either host
- The final ACK handshake completes build the TCP connection
- Once an MSS value is set, it should not be increased
This process works to allow each end host to effectively
know the MTU of the other end of the TCP connection, but does nothing to
identify a smaller MTU along the path. This is where the ICMP and ICMPv6
unreachable messages come into play, informing one of the end hosts about a
smaller Path MTU. The offending host will then apply this new value to the TCP
MSS option. The far end host will be identified of this new value when
receiving the next TCP segment.
Another way to modify the TCP MSS value, outside of end host
communication, is to have an in-path router modify the TCP MSS value of
inflight segments. This technique can be used to preemptively modify the MSS to
a lower specific value to avoid a packet from being dropped and ICMP message
generated.
Latency
Latency is also known as round-trip delay time (RTD) or
route-trip time (RTT). RTT is the length of time it takes for a signal to be
send from one host to another plus the length of time for the reply. TCP is a
connection oriented protocol, meaning that it strives to ensure successful
delivery of information. This is achieved by the use of acknowledgements that
segments have arrived. While great for ensuring data delivery, acknowledgements
to come with the drawbacks of having to be received before more data can be
sent. This means that a network path with a high RTT can perform slower due to
the need to wait for ACKs.
- Example: suppose a network path has a maximum bandwidth of 100 mbps but a RTT of 200 ms, and a host wants to transmit a large file over this path. Sure, the host can send a full window of data at 100 mbps, but will have to wait for far end host to receive the data, taking 100 ms. When the TCP window is full, the receiving host will have to acknowledge the receipt of the data, taking another 100 ms. Quickly, the file transfer starts to have performance issues.
There are ways to alleviate this problem.
- Increase the TCP Window size
- Using the TCP window scaling option, the window can be increased from a 16-bit to a 32-bit value
- The TCP selective acknowledgements options help to identify multiple missed segments, preventing an entire window of data from being held up
- Switch to a connectionless protocol, like UDP, that does not require acknowledgements
- A UDP connection will suffer from the initial delay to reach the end host, but, since UDP doesn’t stop an wait, data can be continuously sent
Windowing (Window Scaling)
https://tools.ietf.org/html/rfc1323
(window scaling RFC)
TCP Window Size is a 16-bit value (65536 bytes standard max)
in the TCP header (NOT option). It is used to identify to the far end host how
much buffer space is allocated for the TCP connection. In other words, one host
is telling the other host how much unacknowledged data can be sent. This works
both ways. The window size value is adjusted up or down based on performance.
If the receive buffers are full, a host can respond with a window size of 0,
stopping all further payload from being sent. Once buffer space is freed, the
window size can be increased, resuming data transmission. Also, once an entire
window of data has been sent, the sending host must wait for the window
acknowledgement before sending any more data.
On connections with high bandwidth AND high RTT (delay), it
can be easier to fill a smaller window buffer. Window scaling TCP option can be
used to increase, or scale, the window size variable from a 16-bit to a pseudo 32-bit
value (4+ billion bytes max). This is an option that must be set by both end
hosts at the initiation of the TCP handshake, in the SYN and SYN-ACK packets.
The window value is adjusted as normal; however, the bits now represent the
new, scaled values.
Bandwidth Delay Product
BDP, or Bandwidth Delay Product, refers to the maximum
amount of data on a network circuit at any given time. The product is the
multiple of a link’s capacity (in bps) and RTT (Round Trip Time or latency). A
lower BDP is desired as this value is also the amount of data that has not been
acknowledged. A high BDP, known as Long-Delay paths, can create TCP window
buffer issues.
- bps * delay
- or simply, B*D
Global Synchronization
Global synchronization is a network phenomenon where in
multiple TCP flows in a high utilization environment experience a congestion
event. Being a well-known and established standard, TCP behaves in a
predictable and uniform fashion. In this case, though, this works against TCP
as all the flows perform similar, if not the same, techniques to avoid
congestion. Sensing packet loss, TCP invokes a “slow-start” algorithm, backing
off sending flows in an attempt to let the network stabilize, then slowly
increasing (or ramping up) transmission speed at the same time. As a result,
this triggers the “slow-start” algorithm again, having all the TCP flows back
off sending data. If multiple TCP begin behaving in this same manor, they can
be thought of as synchronized, causing the network to appear to continually
shutter.
This phenomenon is generally attributed to a tail drop
mechanism used to handle congestion. Tail drop can be thought of as a first in,
first out queue for an interface buffer. Once a packet enters the buffer it is
set to be processed. All subsequent packets trying to enter a full buffer are
dropped until packets are able to be processed and buffer space becomes
available. This queueing mechanism effects each TCP flow evenly, causing each
TCP flow to react in the same way. Tail drop is especially vulnerable to very
bursty traffic.
Queuing mechanisms thought to reduce the likelihood of
global synchronization:
- Random Early Detection (RED) – Once the interface buffer is full, RED tries to drop packets based on the number of packets a host has in the buffer. The original implementation does not leverage QoS for intendent drops.
- Weighted Random Early Detection (WRED) – An extension to standard RED, WRED can assign a queue threshold based on a traffic class, generally a QoS (DSCP or CoS) marking is used to add traffic to a class.
TCP Options
TCP options can be included into the TCP header, just before
the data payload. The options fields start at bit 256. TCP options are
identified by a “kind”, similar in meaning to type.
Interesting TCP options include:
- Kind 3 - Window Scale
- Kind 4 and 5 – SACK (Selective Acknowledgment)
- Kind 8 – Timestamps
- Kind 14 and 15 – TCP Alternate Checksum