From van@ee.lbl.gov Mon Mar 14 07:46:23 1994 Message-Id: <9403141543.AA02092@rx7.ee.lbl.gov> To: end2end-tf@isi.edu Cc: danzig@catarina.usc.edu Subject: problems with Arizona's vegas Date: Mon, 14 Mar 94 07:43:28 PST From: Van Jacobson Status: OR During the e2e meeting I tried to point out that there were a number of serious problems with Arizona's vegas TCP mods. Unfortunately I needed to draw a picture to illustrate what was going on & couldn't do that from 600 miles away. Appended to this message is the picture. Just to recap, I tried to explain that Arizona's mod to the Fast Retransmit algorithm was a very bad idea because (a) it couldn't possibly work if RTO was set correctly and (b) even if it worked, the gain would be negligible. The effect of (a) is that the algorithm does nothing if the RTT/RTO estimate is correct but generates data very quickly if the estimate gets screwed up. This is exactly the opposite of the behavior one wants in a large scale Internet. To see why this is true, the attached postscript file shows the rough chronology of a packet loss. (The scale is approximately right for the T1 NSFNet. Keep the packet spacing the same and move the RTT marks out by a factor of 30 for the T3 NSFNet.) The thick dashed diagonal line at the top of the page is the packet that was lost & the thin dashed diagonal line marks where its ack would have come back. There are a window's worth of packets in transit following the loss and some of them and their acks are shown. Since a packet was lost, all the acks are duplicates. Arizona claims that making retransmit fire on the first duplicate ack (labeled '1') instead of the third (labeled '3') will make a significant performance difference. This is obviously silly: The cost (in throughput) of the packet loss is proportional to the time it takes the sender to learn it has filled the hole in the receiver's sequence space which must be at least one RTT (roughly the time from 1RTT to 2RTT in the diagram). If the RTT is R and a packet time is P, the reduction in recovery time (increase in throughput) could be at most (R+P)/(R+3P). For the old T1 backbone, R is about 26 packet times so the theoretical maximum improvement is ~7% and, as we scale up to faster backbones, the expected improvement falls like the inverse square of the backbone bandwidth. One might be tempted to implement Arizona's algorithm in the hope of getting some part of the 7% but a bit of inspection shows that would be a very antisocial thing to do: Arizona decides to send if the retransmit timer has expired when the first dup ack arrives. Notice that the first dup ack arrives essentially 1 RTT after the missing packet was sent so the only way the retransmit timer could have expired would be if RTO = RTT. In other words, Arizona's algorithm sends whenever the retransmit timer is broken. Since most of the TCP timing code was broken by vegas, it's worthwhile pointing out that whenever there are competing connections sharing a path, transient RTT fluctuations of twice the minimum are completely normal (they just represent other connections starting or restarting after a loss) so it is never reasonable for RTO to be less than 2*RTT. Notice from the picture that the last possible dup ack arrives just before 2*RTT so there is no variation of Arizona's algorithm that would work with a correctly set timer. In passing, it should be noted that the 2*RTT minimum for RTO is necessary but not sufficient -- If your path is `short' but you share a bottleneck with someone whose path is `long', you can see transient fluctuations in RTT of roughly his path length. As I tried to say during the meeting, this means that conservative (i.e., scalable) engineering should guarantee that RTO is never less than 100-200ms. This is basically what BSD does but is something else that Arizona broke. - Van