7689 – Fatal crash on XML-RPC timeout

Bug 7689 - Fatal crash on XML-RPC timeout

Summary: Fatal crash on XML-RPC timeout

Status:	CLOSED FIXED

Alias:	None

Product:	ThinLinc
Classification:	Unclassified
Component:	VSM Server (show other bugs)
Version:	trunk
Hardware:	PC Unknown

Importance:	P2 Normal
Target Milestone:	4.13.0
Assignee:	Pierre Ossman

URL:
Keywords:	linma_tester, relnotes

Depends on:
Blocks:

Reported:	2021-04-20 13:00 CEST by Pierre Ossman
Modified:	2022-07-07 09:36 CEST (History)
CC List:	1 user (show)

See Also:	7592 7964
Acceptance Criteria:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Pierre Ossman cendio

2021-04-20 13:00:24 CEST

It seems that the planets can align just right and cause vsmserver to crash and burn instead of just logging when an XML-RPC request times out:

> Traceback (most recent call last):
>   File "/opt/thinlinc/sbin/vsmserver", line 20, in <module>
>     VSMServer ( sys . argv )
>   File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 155, in __init__
>     self . loop ( )
>   File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 423, in loop
>     I11I11i1I = self . run_delayed_calls ( )
>   File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 381, in run_delayed_calls
>     oOO0OOOo . func ( * oOO0OOOo . args , ** oOO0OOOo . kw )
>   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 171, in handle_timeout
>     self . parent . log . warning ( "Timeout communicating with socket %s, closing" % str ( self . socket . getpeername ( ) ) )
>   File "/usr/lib64/python2.7/socket.py", line 224, in meth
>     return getattr(self._sock,name)(*args)
> socket.error: [Errno 107] Transport endpoint is not connected

It is unclear if this is the server or client side of an XML-RPC call. The triggering condition though is massive load on the system.

Comment 2 Pierre Ossman cendio

2021-04-20 14:06:50 CEST

This was supposedly fixed back in r25956 on bug 2862. However there is something we missed as the issue is still with us. We also failed to reproduce the issue then as well, so we likely never got a good fix.



It is very unclear how this can happen. socket.getpeername() only throws that error before it is connected, and after socket.shutdown(). However we never call socket.shutdown(), and self.connected is only set after the socket is verifiably connected. So there is some subtle bug here. Could even be a bug in glibc or the kernel where the low level socket is in a broken state.

Comment 3 Pierre Ossman cendio

2021-04-20 14:08:08 CEST

As an alternative we can start using self.addr instead. It is set by asyncore and should contain something valid.

(And in a worst case scenario it will just contain None, which won't crash)

Comment 4 Pierre Ossman cendio

2021-04-20 14:17:25 CEST

(In reply to Pierre Ossman from comment #0)
> 
> It is unclear if this is the server or client side of an XML-RPC call. The
> triggering condition though is massive load on the system.

I'm guessing the issue is with the server side of things. The client code seems to avoid this logging.

Perhaps socket.accept() can give us a broken socket under heavy load? However asyncore has code to detect this, so that still wouldn't explain things. :/

Comment 5 Pierre Ossman cendio

2021-04-20 14:31:04 CEST

Since we cannot find the scenario that triggers this we'll have to fix it blindly. Completely avoiding socket.getpeername() should be sufficiently safe.

Comment 7 Pierre Ossman cendio

2021-04-20 14:35:58 CEST

Adjusted now.

Test by causing timeouts on incoming XML-RPC handlers.

Comment 8 Linn cendio

2021-05-06 16:38:14 CEST

Tested with build 2048 on Ubuntu 20.04, and everything seems to be working as expected. However, I did not manage to reproduce the issue. Below are the commands I used to cause timeouts as well as the scenarios I tested.

-- Commands --
The first line disables access from <ip> on <port>, and the second line enables access again.
> sudo iptables -I INPUT -p tcp -s <ip> --dpor <port> -j DROP
> sudo iptables -D INPUT -p tcp -s <ip> --dpor <port> -j DROP

-- Scenario 1; repeated shorter timouts --

1. Log in to ThinLinc through the client.

2. On the server, I made a script to alternate enabling or disabling access from the client's IP. The swap between enabling or disabling occured every 5 seconds.

3. I let the script run for 5 minutes while using the client a bit meanwhile. No odd log messages in vsmserver.log.


-- Scenario 2; one long timeout --

1. Log in to ThinLinc through the client.

2. On the server, disable access from the client's IP.

3. Check that the client still is considered connected, by checking the session status in e.g. tlwebadm. 

4. Let the client session sit for 10 minutes. No odd log message in vsmserver.log.

Note You need to log in before you can comment on or make changes to this bug.