Bug 3217 - Invalid "Address already in use" on some systems
Summary: Invalid "Address already in use" on some systems
Status: NEW
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Agent (show other bugs)
Version: 2.1.0
Hardware: PC All
: P2 Normal
Target Milestone: MediumPrio
Assignee: Bugzilla mail exporter
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-08-24 10:34 CEST by Pierre Ossman
Modified: 2023-11-20 16:29 CET (History)
0 users

See Also:
Acceptance Criteria:


Attachments

Comment 2 Peter Åstrand cendio 2009-08-25 10:34:47 CEST
At least track now. 
Comment 3 Pierre Ossman cendio 2019-05-14 12:45:23 CEST
We repeatedly get this in the automatic system tests, and we don't really know why.
Comment 4 Pierre Ossman cendio 2023-09-21 09:05:02 CEST
I got this on a test server now, and the reason is likely this odd connection stuck in a closing state:

> [cendio@lab-210 ~]$ ss -nt | grep 904
> FIN-WAIT-1 0      168        127.0.0.1:904     127.0.0.1:904         

It eventually timed out and vsmagent could start properly.
Comment 5 Pierre Ossman cendio 2023-09-21 09:06:07 CEST
Perhaps we should be setting SO_REUSEADDR? It seems to be a bit of a norm for servers on well-known ports?

https://stackoverflow.com/questions/6960219/why-not-using-so-reuseaddr-on-unix-tcp-ip-servers
Comment 6 Pierre Ossman cendio 2023-10-27 10:43:44 CEST
asyncio seems to be using SO_REUSEADDR by default, so this may be resolved as of bug 8224.
Comment 7 Pierre Ossman cendio 2023-11-20 14:59:58 CET
Further digging shows that we've always set SO_REUSEADDR in every version of the code. So that is not the core issue. As such, it is unlikely that bug 8224 resolves anything.

Comment 4 is also completely absurd. Both ends of the connection have the same address and port, which should be impossible, as TCP would be unable to know which end is which.

I think there is some very odd corner case we are triggering here.
Comment 8 Pierre Ossman cendio 2023-11-20 15:59:50 CET
I tried to look at the Linux corner to see if any corner cases were apparent, but that code is unfortunately very difficult to follow.

One theoretical scenario is that we have something similar to what happened in bug 3878. I.e. vsmserver manages to allocate an *outgoing* socket on port 904 whilst vsmagent is turned off.

Since the reported cases are for just restarting vsmagent, this seems extremely unlikely. Not impossible, though.

To further narrow the race window, any TIME-WAIT sockets lingering around from before vsmagent was stopped will also prevent vsmserver from stealing the port. And there seems to almost constantly be such a socket in place, as vsmserver polls the agent faster than TIME-WAIT sockets are culled.
Comment 9 Pierre Ossman cendio 2023-11-20 16:02:58 CET
I don't think there is anything we can do to progress this issue without more information.

The problem seems to still exist, at least, as the last occurrence was just a couple of months ago. Fortunately, it was just internal, and we haven't had any customer reports in ages.
Comment 10 Pierre Ossman cendio 2023-11-20 16:12:31 CET
(In reply to Pierre Ossman from comment #7)
> 
> Comment 4 is also completely absurd. Both ends of the connection have the
> same address and port, which should be impossible, as TCP would be unable to
> know which end is which.
> 

Apparently the kernel lets you do this absurdity. Assuming the port is completely free (i.e. no listening or TIME-WAIT sockets), you can do:

  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  s.bind(("", 904))
  s.connect(("127.0.0.1", 904))

This succeeds for some extremely odd reason. You end up getting a socket connected to itself. Despite there not even being any listen() in there!

This socket will prevent vsmagent from starting. And closing the socket is not enough to resolve things, as it will continue to block things in a TIME-WAIT state for a while.


vsmserver could in theory do exactly these steps, assuming ports 905-1023 are all busy, and 904 is not blocked by a previous instance of vsmagent.
Comment 11 Pierre Ossman cendio 2023-11-20 16:25:17 CET
(In reply to Pierre Ossman from comment #8)
> 
> To further narrow the race window, any TIME-WAIT sockets lingering around
> from before vsmagent was stopped will also prevent vsmserver from stealing
> the port. And there seems to almost constantly be such a socket in place, as
> vsmserver polls the agent faster than TIME-WAIT sockets are culled.

This window should not exist. Sockets should linger in TIME-WAIT for at least 60 seconds, but vsmserver polls every 40 seconds. So something more would need to explain why those 20+ remaining seconds aren't enough to keep vsmserver from stealing the port.

Note You need to log in before you can comment on or make changes to this bug.