When the client connects it tries to check that all local ports it needs (for local devices) are unused. This process can currently hang if there is something unresponsive on one of those ports as we do a connect() to that port. Unfortunately we don't log anything at this part, so the only thing the user sees is that the client hangs with "Connection to agent 1.2.3.4..." in the status bar. We should at least have a short timeout here, or ideally not use connect() at all. Can't we use bind()? I saw this on a Windows 10 laptop here where something has gone terribly wrong and I have lots of connections stuck in TIME_WAIT. There is nothing listening on the port tested (62596) in this case, but all those dead connections are probably somehow still causing things to lock up.
This happened now again, this time with the Windows 10 machine in the lab. Again, lots of sockets in TIME_WAIT.
This seems to happen for our CEO regularly on his Windows 11 laptop. I've only diagnosed it once, though, but the symptoms match. Besides the "Connecting to agent..." text, the log also shows that it hangs between killing the previous ssh process, and feeding expected host keys to the new tunnel object. We know this because we currently have bug 8536. This likely means it's the same issue, because not much else happens between those steps.