This is a continuation of bug 4735. If you massively overcongest the wire then we still see scenarios where the congestion control fails miserably and the RTT to the server increases massively.
The scenario happens when the updates are much larger than the BDP, which is easy to simulate by having a low latency and a low bandwidth. As an example we've tested 50 ms RTT and 128 kbps bandwidth. We have not explored any realistic scenarios at this point.
The problem seems to be an overly aggressive idle timeout in the congestion handle. TCP closes the congestion window once more than RTO time has passed since the last transmission. This is generally 2×RTT. We try to do the same, however our approximation is probably too simplistic.
The major flaw is that we only measure the time since our last transmission. This is not the same as the last transmission by TCP. We will likely have some buffering, and in the failing scenario this buffering is massive. So we need to make a guess as to when things left an uncongested state, and measure one RTO from there.
TCP also never sets RTO below 1 second, but we've set the limit to 100ms. Not sure if that was an attempt at a cautionary value.
Lastly, TCP sets RTO based on current RTT (and its variation), whilst we use the minimum RTT instead.
If you enable congestion control logging you can see this from Xvnc:
> Mon Nov 29 16:25:44 2021
> Congestion: Connection idle for 424 ms, resetting congestion control
> Mon Nov 29 16:25:50 2021
> Congestion: Connection idle for 6334 ms, resetting congestion control
> Mon Nov 29 16:25:52 2021
> Congestion: Connection idle for 492 ms, resetting congestion control
> Mon Nov 29 16:26:05 2021
> Congestion: Connection idle for 565 ms, resetting congestion control
Since there is a steady flow of data this should not happen. There is always data ready to be sent, i.e. we are never truly idle.
Also note that it rarely manages to get a proper measurement in place.
There is some discussion about this upstream: