It seems that the planets can align just right and cause vsmserver to crash and burn instead of just logging when an XML-RPC request times out: > Traceback (most recent call last): > File "/opt/thinlinc/sbin/vsmserver", line 20, in <module> > VSMServer ( sys . argv ) > File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 155, in __init__ > self . loop ( ) > File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 423, in loop > I11I11i1I = self . run_delayed_calls ( ) > File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 381, in run_delayed_calls > oOO0OOOo . func ( * oOO0OOOo . args , ** oOO0OOOo . kw ) > File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 171, in handle_timeout > self . parent . log . warning ( "Timeout communicating with socket %s, closing" % str ( self . socket . getpeername ( ) ) ) > File "/usr/lib64/python2.7/socket.py", line 224, in meth > return getattr(self._sock,name)(*args) > socket.error: [Errno 107] Transport endpoint is not connected It is unclear if this is the server or client side of an XML-RPC call. The triggering condition though is massive load on the system.
This was supposedly fixed back in r25956 on bug 2862. However there is something we missed as the issue is still with us. We also failed to reproduce the issue then as well, so we likely never got a good fix. It is very unclear how this can happen. socket.getpeername() only throws that error before it is connected, and after socket.shutdown(). However we never call socket.shutdown(), and self.connected is only set after the socket is verifiably connected. So there is some subtle bug here. Could even be a bug in glibc or the kernel where the low level socket is in a broken state.
As an alternative we can start using self.addr instead. It is set by asyncore and should contain something valid. (And in a worst case scenario it will just contain None, which won't crash)
(In reply to Pierre Ossman from comment #0) > > It is unclear if this is the server or client side of an XML-RPC call. The > triggering condition though is massive load on the system. I'm guessing the issue is with the server side of things. The client code seems to avoid this logging. Perhaps socket.accept() can give us a broken socket under heavy load? However asyncore has code to detect this, so that still wouldn't explain things. :/
Since we cannot find the scenario that triggers this we'll have to fix it blindly. Completely avoiding socket.getpeername() should be sufficiently safe.
Adjusted now. Test by causing timeouts on incoming XML-RPC handlers.
Tested with build 2048 on Ubuntu 20.04, and everything seems to be working as expected. However, I did not manage to reproduce the issue. Below are the commands I used to cause timeouts as well as the scenarios I tested. -- Commands -- The first line disables access from <ip> on <port>, and the second line enables access again. > sudo iptables -I INPUT -p tcp -s <ip> --dpor <port> -j DROP > sudo iptables -D INPUT -p tcp -s <ip> --dpor <port> -j DROP -- Scenario 1; repeated shorter timouts -- 1. Log in to ThinLinc through the client. 2. On the server, I made a script to alternate enabling or disabling access from the client's IP. The swap between enabling or disabling occured every 5 seconds. 3. I let the script run for 5 minutes while using the client a bit meanwhile. No odd log messages in vsmserver.log. -- Scenario 2; one long timeout -- 1. Log in to ThinLinc through the client. 2. On the server, disable access from the client's IP. 3. Check that the client still is considered connected, by checking the session status in e.g. tlwebadm. 4. Let the client session sit for 10 minutes. No odd log message in vsmserver.log.