At least track now.
We repeatedly get this in the automatic system tests, and we don't really know why.
I got this on a test server now, and the reason is likely this odd connection stuck in a closing state: > [cendio@lab-210 ~]$ ss -nt | grep 904 > FIN-WAIT-1 0 168 127.0.0.1:904 127.0.0.1:904 It eventually timed out and vsmagent could start properly.
Perhaps we should be setting SO_REUSEADDR? It seems to be a bit of a norm for servers on well-known ports? https://stackoverflow.com/questions/6960219/why-not-using-so-reuseaddr-on-unix-tcp-ip-servers
asyncio seems to be using SO_REUSEADDR by default, so this may be resolved as of bug 8224.
Further digging shows that we've always set SO_REUSEADDR in every version of the code. So that is not the core issue. As such, it is unlikely that bug 8224 resolves anything. Comment 4 is also completely absurd. Both ends of the connection have the same address and port, which should be impossible, as TCP would be unable to know which end is which. I think there is some very odd corner case we are triggering here.
I tried to look at the Linux corner to see if any corner cases were apparent, but that code is unfortunately very difficult to follow. One theoretical scenario is that we have something similar to what happened in bug 3878. I.e. vsmserver manages to allocate an *outgoing* socket on port 904 whilst vsmagent is turned off. Since the reported cases are for just restarting vsmagent, this seems extremely unlikely. Not impossible, though. To further narrow the race window, any TIME-WAIT sockets lingering around from before vsmagent was stopped will also prevent vsmserver from stealing the port. And there seems to almost constantly be such a socket in place, as vsmserver polls the agent faster than TIME-WAIT sockets are culled.
I don't think there is anything we can do to progress this issue without more information. The problem seems to still exist, at least, as the last occurrence was just a couple of months ago. Fortunately, it was just internal, and we haven't had any customer reports in ages.
(In reply to Pierre Ossman from comment #7) > > Comment 4 is also completely absurd. Both ends of the connection have the > same address and port, which should be impossible, as TCP would be unable to > know which end is which. > Apparently the kernel lets you do this absurdity. Assuming the port is completely free (i.e. no listening or TIME-WAIT sockets), you can do: s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind(("", 904)) s.connect(("127.0.0.1", 904)) This succeeds for some extremely odd reason. You end up getting a socket connected to itself. Despite there not even being any listen() in there! This socket will prevent vsmagent from starting. And closing the socket is not enough to resolve things, as it will continue to block things in a TIME-WAIT state for a while. vsmserver could in theory do exactly these steps, assuming ports 905-1023 are all busy, and 904 is not blocked by a previous instance of vsmagent.
(In reply to Pierre Ossman from comment #8) > > To further narrow the race window, any TIME-WAIT sockets lingering around > from before vsmagent was stopped will also prevent vsmserver from stealing > the port. And there seems to almost constantly be such a socket in place, as > vsmserver polls the agent faster than TIME-WAIT sockets are culled. This window should not exist. Sockets should linger in TIME-WAIT for at least 60 seconds, but vsmserver polls every 40 seconds. So something more would need to explain why those 20+ remaining seconds aren't enough to keep vsmserver from stealing the port.