3217 – Invalid "Address already in use" on some systems

Bug 3217 - Invalid "Address already in use" on some systems

Summary: Invalid "Address already in use" on some systems

Status:	NEW

Alias:	None

Product:	ThinLinc
Classification:	Unclassified
Component:	VSM Agent (show other bugs)
Version:	2.1.0
Hardware:	PC All

Importance:	P2 Normal
Target Milestone:	MediumPrio
Assignee:	Bugzilla mail exporter

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-08-24 10:34 CEST by Pierre Ossman
Modified:	2023-11-20 16:29 CET (History)
CC List:	0 users

See Also:	217
Acceptance Criteria:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Comment 2 Peter Åstrand cendio

2009-08-25 10:34:47 CEST

At least track now.

Comment 3 Pierre Ossman cendio

2019-05-14 12:45:23 CEST

We repeatedly get this in the automatic system tests, and we don't really know why.

Comment 4 Pierre Ossman cendio

2023-09-21 09:05:02 CEST

I got this on a test server now, and the reason is likely this odd connection stuck in a closing state:

> [cendio@lab-210 ~]$ ss -nt | grep 904
> FIN-WAIT-1 0      168        127.0.0.1:904     127.0.0.1:904         

It eventually timed out and vsmagent could start properly.

Comment 5 Pierre Ossman cendio

2023-09-21 09:06:07 CEST

Perhaps we should be setting SO_REUSEADDR? It seems to be a bit of a norm for servers on well-known ports?

https://stackoverflow.com/questions/6960219/why-not-using-so-reuseaddr-on-unix-tcp-ip-servers

Comment 6 Pierre Ossman cendio

2023-10-27 10:43:44 CEST

asyncio seems to be using SO_REUSEADDR by default, so this may be resolved as of bug 8224.

Comment 7 Pierre Ossman cendio

2023-11-20 14:59:58 CET

Further digging shows that we've always set SO_REUSEADDR in every version of the code. So that is not the core issue. As such, it is unlikely that bug 8224 resolves anything.

Comment 4 is also completely absurd. Both ends of the connection have the same address and port, which should be impossible, as TCP would be unable to know which end is which.

I think there is some very odd corner case we are triggering here.

Comment 8 Pierre Ossman cendio

2023-11-20 15:59:50 CET

I tried to look at the Linux corner to see if any corner cases were apparent, but that code is unfortunately very difficult to follow.

One theoretical scenario is that we have something similar to what happened in bug 3878. I.e. vsmserver manages to allocate an *outgoing* socket on port 904 whilst vsmagent is turned off.

Since the reported cases are for just restarting vsmagent, this seems extremely unlikely. Not impossible, though.

To further narrow the race window, any TIME-WAIT sockets lingering around from before vsmagent was stopped will also prevent vsmserver from stealing the port. And there seems to almost constantly be such a socket in place, as vsmserver polls the agent faster than TIME-WAIT sockets are culled.

Comment 9 Pierre Ossman cendio

2023-11-20 16:02:58 CET

I don't think there is anything we can do to progress this issue without more information.

The problem seems to still exist, at least, as the last occurrence was just a couple of months ago. Fortunately, it was just internal, and we haven't had any customer reports in ages.

Comment 10 Pierre Ossman cendio

2023-11-20 16:12:31 CET

(In reply to Pierre Ossman from comment #7)
> 
> Comment 4 is also completely absurd. Both ends of the connection have the
> same address and port, which should be impossible, as TCP would be unable to
> know which end is which.
> 

Apparently the kernel lets you do this absurdity. Assuming the port is completely free (i.e. no listening or TIME-WAIT sockets), you can do:

  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  s.bind(("", 904))
  s.connect(("127.0.0.1", 904))

This succeeds for some extremely odd reason. You end up getting a socket connected to itself. Despite there not even being any listen() in there!

This socket will prevent vsmagent from starting. And closing the socket is not enough to resolve things, as it will continue to block things in a TIME-WAIT state for a while.


vsmserver could in theory do exactly these steps, assuming ports 905-1023 are all busy, and 904 is not blocked by a previous instance of vsmagent.

Comment 11 Pierre Ossman cendio

2023-11-20 16:25:17 CET

(In reply to Pierre Ossman from comment #8)
> 
> To further narrow the race window, any TIME-WAIT sockets lingering around
> from before vsmagent was stopped will also prevent vsmserver from stealing
> the port. And there seems to almost constantly be such a socket in place, as
> vsmserver polls the agent faster than TIME-WAIT sockets are culled.

This window should not exist. Sockets should linger in TIME-WAIT for at least 60 seconds, but vsmserver polls every 40 seconds. So something more would need to explain why those 20+ remaining seconds aren't enough to keep vsmserver from stealing the port.

Note You need to log in before you can comment on or make changes to this bug.