2862 – vsmagent crashes when machine is under heavy load

Bug 2862 - vsmagent crashes when machine is under heavy load

Summary: vsmagent crashes when machine is under heavy load

Status:	CLOSED FIXED

Alias:	None

Product:	ThinLinc
Classification:	Unclassified
Component:	VSM Agent (show other bugs)
Version:	pre-1.0
Hardware:	PC All

Importance:	P2 Major
Target Milestone:	4.0.0
Assignee:	Aaron Sowry

URL:
Keywords:	ossman_tester

Depends on:
Blocks:

Reported:	2008-08-14 11:05 CEST by Pierre Ossman
Modified:	2012-11-28 12:08 CET (History)
CC List:	0 users

See Also:
Acceptance Criteria:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Comment 1 Peter Åstrand cendio

2008-10-07 14:07:12 CEST

We also have bug 2919, but seems to be a bit different.

Comment 2 Peter Åstrand cendio

2008-11-03 15:48:18 CET

From Issue6885, we can also see this on vsmserver:

Traceback (most recent call last):
  File "/opt/thinlinc/sbin/vsmserver", line 22, in ?
    VSMServer ( sys . argv )
  File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 141, in __init__
    self . loop ( )
  File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 430, in loop
    o0o0oOOOo0oo = self . run_delayed_calls ( )
  File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 388, in run_delayed_calls
    I1i1i1 . func ( * I1i1i1 . args , ** I1i1i1 . kw )
  File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 163, in handle_timeout
    self . parent . log . warning ( "Timeout reading socket %s, closing" % str ( self . socket . getpeername ( ) ) )
  File "<string>", line 1, in getpeername
socket.error: (107, 'Transport endpoint is not connected')

Comment 4 Peter Åstrand cendio

2009-03-17 13:51:06 CET

Difficult problem that happens seldom. Postpone.

Comment 5 Aaron Sowry cendio

2012-08-03 14:33:51 CEST

Having some trouble reproducing this bug easily, however it looks like the problem lies in the logging of handle_timeout() in xmlrpc.py:

def handle_timeout(self, dolog=True):
    """Handle read or write timeout on socket"""
    if self.writable() and dolog:
        self.parent.log.warning("Timeout writing socket %s, closing" % str(self.socket.getpeername()))
    elif dolog:
        self.parent.log.warning("Timeout reading socket %s, closing" % str(self.socket.getpeername()))
    self.close()

If the timeout is due to the socket not being connected, socket.getpeername() will fail. I suppose we should either handle this case more elegantly within the method, or not bother with writing the peername in the logging message.

Comment 6 Aaron Sowry cendio

2012-08-03 16:18:25 CEST

Perhaps we could log socket.getsockname() instead of socket.getpeername()

Comment 7 Aaron Sowry cendio

2012-08-06 12:38:02 CEST

Fixed in r25565.

Comment 8 Aaron Sowry cendio

2012-08-08 10:09:55 CEST

Autotest updated in r25572

Comment 9 Pierre Ossman cendio

2012-10-03 10:50:06 CEST

getsockname() is a rather useless piece of information in most cases as it will just give you a local address of the machine, and the listening port of the server (i.e. things that are static and already well known).

The code should probably report both getsockname() and getpeername(), and changed to handle the fact that they can raise exceptions. Then we'll get useful information when possible.

(side note: to safely get the peer address in all cases, we would need to store it from the accept() call, but that's a lot more work)

Comment 10 Aaron Sowry cendio

2012-10-04 13:18:46 CEST

There are 2 problems here:

1) The function asynchat.writable() can return True even if nothing is connected to the socket. Our xmlrpc.handle_timout() function assumed that if asynchat.writable() is True, then there must be something connected to it. If nothing is connected to the socket, then socket.getpeername() will raise an exception, which is what was causing the initial problem.

2) Our xmlrpc.handle_timout() function assumed that if the socket is writable, then it must be a write timeout, otherwise it must be a read timeout. This isn't necessarily true; in fact, using the current handle_timout() implementation, it doesn't seem possible to tell exactly *what* timed out. The log text has therefore been changed to "communcation" timeout rather than "read/write" timeout.

Fixed in 25956.

Comment 11 Aaron Sowry cendio

2012-10-05 10:33:17 CEST

Autotests updated in r25959

Comment 12 Aaron Sowry cendio

2012-10-05 15:33:35 CEST

(In reply to comment #11)
> Autotests updated in r25959

Fixed again in r25960.

Comment 13 Pierre Ossman cendio

2012-10-17 13:23:20 CEST

Code review. Looks ok now.

Note You need to log in before you can comment on or make changes to this bug.