Bug 8224 - vsm xml-rpc handling relies on deprecated asyncore
Summary: vsm xml-rpc handling relies on deprecated asyncore
Status: REOPENED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: Other (show other bugs)
Target Milestone: 4.16.0
Assignee: Pierre Ossman
URL:
Keywords: prosaic
Depends on:
Blocks: 7636
  Show dependency treegraph
 
Reported: 2023-09-11 13:25 CEST by Pierre Ossman
Modified: 2023-12-01 14:21 CET (History)
3 users (show)

See Also:
Acceptance Criteria:
MUST: * The XML-RPC code no longer relies on the system asyncore or asynchat modules SHOULD: * The XML-RPC code uses asyncio terminology rather than asyncore such * The XML-RPC code is no ThinLinc code that is redundant with asyncio functionality * The XML-RPC code is converted to asyncio models and abstractions (e.g. protocols)


Attachments

Description Pierre Ossman cendio 2023-09-11 13:25:30 CEST
Broken out from bug 7636. See that bug for details.

The VSM code has both server and client handling for XML-RPC. Both need to be converted to use asyncio instead of asyncore.
Comment 12 Pierre Ossman cendio 2023-09-19 11:37:12 CEST
This conversion is a bit problematic as asyncore has a strongly object oriented design, whilst asyncio uses a mix of functions and objects.

To make this transition easier, we'll probably need to more clearly separate and reorganise some responsibilites in the code. E.g. we can't rely on subclassing and overriding methods as much, so other means of passing information need to be used.

The plan is to make the core xml-rpc code more ThinLinc agnostic and treat it more like an external lib. That should hopefully make things easier to follow.
Comment 34 Pierre Ossman cendio 2023-10-03 10:59:52 CEST
The VSMServer/VSMAgent classes are currently used as a "hub" for everything else. I.e. that's how all other code finds configuration, logging, session database, etc.

This doesn't fit well in to asyncio as you don't have your own subclass for the TCP servers. So we need to reorganise things some other way.

I'm leaning towards moving most things to simple functions and variables around main() instead. I think this fits well with the XML-RPC protocol, as that is function-based, not object-based.
Comment 48 Adam Halim cendio 2023-10-17 11:33:08 CEST
Tested raising an exception in xmlrpc callback, noticed a change compared to 4.15.0. Previously, an exception from an xmlrpc callback would start a session when a client connected. Currently, in build 3398, a session is not created when an xmlrpc callback crashes. This was tested by raising an exception in call_unbindports.py::handle_returnvalue() in 4.15.0, and in loginhandler_common.py::_unbind_ports_done() in build 3398.

Also, in 4.15.0, the error traceback message prints ". Marking agent as down.", but the agent is not down. This is probably a bug in 4.15.0.

4.15.0 log:

> 2023-10-16 13:57:06 ERROR vsmserver.session: Unhandled exception trying to unbind ports for display 12 on VSM Agent 127.0.0.1:904: <class 'Exception'> Exception! Traceback (most recent call last):
>             File "/opt/thinlinc/modules/thinlinc/vsm/asyncbase.py", line 104, in ooOOOoOO0
>               obj . handle_read_event ( )
>             File "/usr/lib64/python3.11/asyncore.py", line 425, in handle_read_event
>               self.handle_read()
>             File "/usr/lib64/python3.11/asynchat.py", line 158, in handle_read
>               self.found_terminator()
>             File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 422, in found_terminator
>               self . handle_response ( )
>             File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 454, in handle_response
>               self . handle_returnvalue ( )
>             File "/opt/thinlinc/modules/thinlinc/vsm/call_unbindports.py", line 41, in handle_returnvalue
>               raise Exception("Exception!")
>           Exception: Exception!
>           . Marking agent as down.

Build 3398 log:

> 2023-10-16 13:55:31 ERROR vsmserver: Unhandled event loop exception: Exception in callback LoginHandlerCommon._unbind_ports_done(<Future finis...pccall.py:116>)
> 2023-10-16 13:55:31 ERROR vsmserver: handle: <Handle LoginHandlerCommon._unbind_ports_done(<Future finis...pccall.py:116>) created at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:102>
> 2023-10-16 13:55:31 ERROR vsmserver: ----------------------------------------
> 2023-10-16 13:55:31 ERROR vsmserver: Object created at (most recent call last):
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 74, in <module>
> 2023-10-16 13:55:31 ERROR vsmserver:     iIi1ii1I1iI11 ( )
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 63, in iIi1ii1I1iI11
> 2023-10-16 13:55:31 ERROR vsmserver:     OoO . loop ( )
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/asyncbase.py", line 224, in loop
> 2023-10-16 13:55:31 ERROR vsmserver:     self . run_forever ( )
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/base_events.py", line 607, in run_forever
> 2023-10-16 13:55:31 ERROR vsmserver:     self._run_once()
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/base_events.py", line 1914, in _run_once
> 2023-10-16 13:55:31 ERROR vsmserver:     handle._run()
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/events.py", line 80, in _run
> 2023-10-16 13:55:31 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 102, in o0OOOoo00ooOo
> 2023-10-16 13:55:31 ERROR vsmserver:     I1 . set_result ( future . result ( ) )
> 2023-10-16 13:55:31 ERROR vsmserver: ----------------------------------------
> 2023-10-16 13:55:31 ERROR vsmserver: Traceback (most recent call last):
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/events.py", line 80, in _run
> 2023-10-16 13:55:31 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/loginhandler_common.py", line 57, in _unbind_ports_done
> 2023-10-16 13:55:31 ERROR vsmserver:     raise Exception("Exception!")
> 2023-10-16 13:55:31 ERROR vsmserver: Exception: Exception!
> 2023-10-16 13:55:31 ERROR vsmserver: ----------------------------------------
Comment 60 Pierre Ossman cendio 2023-10-26 16:15:50 CEST
asyncio's UNIX socket server apparently doesn't clean up the socket after itself. I've filed an issue here:

https://github.com/python/cpython/issues/111246

I guess we'll have to do it manually for now.
Comment 61 Pierre Ossman cendio 2023-10-27 10:15:50 CEST
The existing code for user sockets puts significant effort in to detecting if a socket has been replaced. Unfortunately, I cannot find any reasoning as to why. The socket directory is a privileged directory specific for ThinLinc, so there shouldn't be anything else writing stuff there.

I'm opting for not porting all that complexity over and just letting the code assume that the socket will remain in place.
Comment 71 Samuel Mannehed cendio 2023-11-02 13:00:25 CET
While testing bug 5556, we found that exceptions from xmlrcp.ClientProtocol._handle_response() are insufficiently logged. They are simply converted to strings, no traceback is logged.

We applied following diff:

> M vsm/modules/thinlinc/vsm/xmlrpc.py
> @@ -183,6 +183,7 @@ class ClientProtocol(asyncio.Protocol):
>              self._handle_response()
>  
>      def _handle_response(self):
> +        raise ValueError("WHOOPS")
>          """Decode XMLRPC response and call L{handle_response}, then
>          close the connection.

This only results in log lines similar to these:

> 2023-11-02 11:27:18 WARNING vsmserver: Error checking if agent 127.0.0.1 is alive: WHOOPS
> 2023-11-02 11:27:28 WARNING vsmserver.loadinfo: Error getting load from agent 127.0.0.1: WHOOPS

We probably want more details in the logs here. We haven't tested a similar scenario in the 4.15.0 code, but the handling that prints these lines is new as of commit r40504 for this bug.

A brief review of that commit indicates that the old code printed more details such as tracebacks for these types of exceptions.
Comment 73 Pierre Ossman cendio 2023-11-13 16:59:02 CET
> SHOULD:
>
> * The XML-RPC code is converted to asyncio models and abstractions (e.g. protocols)

One new asyncio feature is the ability to cancel background processes. We've rigged up the new code so that dropped XML-RPC connections will try to cancel the running method handler.

However, it's too much work to add this support to all the different handlers at this time. So we filter out the cancel signal and always let the handlers run to completion. If the connection has been dropped, then we simply throw away the result.

This is identical to the previous asyncore behaviour, so we are not worse off than before.

Added bug 8255 for improving this situation.
Comment 74 Pierre Ossman cendio 2023-11-14 09:29:52 CET
(In reply to Samuel Mannehed from comment #71)
> 
> We probably want more details in the logs here. We haven't tested a similar
> scenario in the 4.15.0 code, but the handling that prints these lines is new
> as of commit r40504 for this bug.
> 
> A brief review of that commit indicates that the old code printed more
> details such as tracebacks for these types of exceptions.

I'm not sure if it's worth the complexity.

The old code caught exceptions from both the core XML-RPC code and the individual method handlers in the same place. The new code separates the two, so we don't have to treat them the same. This allows us to not have to have an extra catch-all exception handler in the XML-RPC code. The downside is that it gets messy if we want to treat different exceptions from the core XML-RPC code in different ways.

We have these cases:

                      Old                 New
===========================================================
XML-RPC I/O Error     Logged, one line    Logged, one line
XML-RPC XML error     Logged, one line    Logged, one line
XML-RPC other error   Logged, traceback   Logged, one line  <=== New bad
Handler I/O Error     Logged, one line    Logged, traceback <=== Old bad
Handler other error   Logged, traceback   Logged, traceback
===========================================================

If we want to get back more of the old behaviour, we would need to check which exception we got back from each XML-RPC call. That would either mean duplication to every place we call stuff, or adding another layer of abstraction, so we can centralise things.

I'm uncertain if that is worth it, given that the core XML-RPC code is fairly small and should see very few changes. We could treat it more like something from the standard library.
Comment 75 Pierre Ossman cendio 2023-11-14 11:00:27 CET
(In reply to Adam Halim from comment #48)
> Tested raising an exception in xmlrpc callback, noticed a change compared to
> 4.15.0. Previously, an exception from an xmlrpc callback would start a
> session when a client connected. Currently, in build 3398, a session is not
> created when an xmlrpc callback crashes. This was tested by raising an
> exception in call_unbindports.py::handle_returnvalue() in 4.15.0, and in
> loginhandler_common.py::_unbind_ports_done() in build 3398.
> 

The two tests were not equivalent, hence the differing results. If I raise an exception in an equivalent place in the new code, I get:

> 2023-11-14 10:58:49 WARNING vsmserver: Error unbinding ports for session lab-210.lkpg.cendio.se:11: Exception!

And it proceeds to connect.
Comment 76 Pierre Ossman cendio 2023-11-14 14:12:21 CET
Everything XML-RPC should be converted now.

> MUST:
> 
>  * The XML-RPC code no longer relies on the system asyncore or asynchat modules
>

Yup. There are no traces left of either asyncore or asynchat.

> SHOULD:
> 
>  * The XML-RPC code uses asyncio terminology rather than asyncore such
>

I think so. We refer to protocols, transports, futures and callbacks now.

>  * The XML-RPC code is no ThinLinc code that is redundant with asyncio functionality

Yes, except for having to "fix" asyncio.TimeoutError because the standard one lacks a description.

>  * The XML-RPC code is converted to asyncio models and abstractions (e.g. protocols)

Mostly. We have two possible remaining changes:

 * Method handlers don't respect the cancel signal (bug 8255)

 * Method handlers should probably have a function API rather than an object one, given that interaction is a simple call and wait for the result

These were deemed too large to do now, though, and is a matter of code clean-up rather than any tangible functionality.
Comment 77 Adam Halim cendio 2023-11-16 14:43:02 CET
I tested adding a fake agent to vsmserver.hconf which would leave some sockets opened after a timeout was triggered.

In my case, getting agent load would time out and produce the following in vsmserver.log:
> /usr/lib64/python3.11/asyncio/base_events.py:1907: ResourceWarning: unclosed <socket.socket fd=7, family=2, type=1, proto=6, laddr=('10.48.2.199', 1022)>
>   handle = self._ready.popleft()
> ResourceWarning: Enable tracemalloc to get the object allocation traceback
> 2023-11-16 11:18:59 WARNING vsmserver.loadinfo: Error getting load from agent 192.0.0.0: Operation timed out
> 2023-11-16 11:18:59 WARNING vsmserver.loadinfo: Marking agent 192.0.0.0 as down
The ResourceWarning is only printed once, but we get the timeout error periodically.

Killing/restarting the server would sometimes dump the following in the log:
> 2023-11-16 11:19:41 INFO vsmserver: Terminating. Have a nice day!
> 2023-11-16 11:19:41 ERROR vsmserver: Task was destroyed but it is pending!
> 2023-11-16 11:19:41 ERROR vsmserver: task: <Task pending name='Task-19' coro=<wait_for() done, defined at /usr/lib64/python3.11/asyncio/tasks.py:426> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/lib64/python3.11/asyncio/base_events.py:427> cb=[iI1iI.<locals>.o0OOOoo00ooOo() at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:100] created at /usr/lib64/python3.11/asyncio/tasks.py:670>
> 2023-11-16 11:19:41 ERROR vsmserver: ----------------------------------------
> 2023-11-16 11:19:41 ERROR vsmserver: Object created at (most recent call last):
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 112, in <module>
> 2023-11-16 11:19:41 ERROR vsmserver:     o0OoOOo00OOO ( )
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 101, in o0OoOOo00OOO
> 2023-11-16 11:19:41 ERROR vsmserver:     I11I11I . run_forever ( )
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/base_events.py", line 607, in run_forever
> 2023-11-16 11:19:41 ERROR vsmserver:     self._run_once()
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/base_events.py", line 1914, in _run_once
> 2023-11-16 11:19:41 ERROR vsmserver:     handle._run()
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/events.py", line 80, in _run
> 2023-11-16 11:19:41 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/loadbalancer.py", line 181, in _periodic_update
> 2023-11-16 11:19:41 ERROR vsmserver:     Oo0O00OOoo = vsmxmlrpccall . call_method ( ii1 , "get_load" )
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 178, in call_method
> 2023-11-16 11:19:41 ERROR vsmserver:     I1I1iIi = asyncio . ensure_future ( asyncio . wait_for ( II , timeout ) )
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/tasks.py", line 649, in ensure_future
> 2023-11-16 11:19:41 ERROR vsmserver:     return _ensure_future(coro_or_future, loop=loop)
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/tasks.py", line 670, in _ensure_future
> 2023-11-16 11:19:41 ERROR vsmserver:     return loop.create_task(coro_or_future)
> sys:1: ResourceWarning: unclosed <socket.socket fd=7, family=2, type=1, proto=6, laddr=('10.48.2.199', 1023)>
> ResourceWarning: Enable tracemalloc to get the object allocation traceback
> 2023-11-16 11:19:41 ERROR vsmserver: Task was destroyed but it is pending!
> 2023-11-16 11:19:41 ERROR vsmserver: task: <Task pending name='Task-22' coro=<BaseSelectorEventLoop.sock_connect() done, defined at /usr/lib64/python3.11/asyncio/selector_events.py:614> wait_for=<Future pending cb=[BaseSelectorEventLoop._sock_write_done(7, handle=<Handle BaseS...events.py:300>)(), Task.task_wakeup()] created at /usr/lib64/python3.11/asyncio/base_events.py:427> cb=[iI.<locals>.O0ooooOoo0O() at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:63] created at /usr/lib64/python3.11/asyncio/tasks.py:670>
> 2023-11-16 11:19:41 ERROR vsmserver: ----------------------------------------
> 2023-11-16 11:19:41 ERROR vsmserver: Object created at (most recent call last):
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 112, in <module>
> 2023-11-16 11:19:41 ERROR vsmserver:     o0OoOOo00OOO ( )
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 101, in o0OoOOo00OOO
> 2023-11-16 11:19:41 ERROR vsmserver:     I11I11I . run_forever ( )
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/base_events.py", line 607, in run_forever
> 2023-11-16 11:19:41 ERROR vsmserver:     self._run_once()
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/base_events.py", line 1914, in _run_once
> 2023-11-16 11:19:41 ERROR vsmserver:     handle._run()
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/events.py", line 80, in _run
> 2023-11-16 11:19:41 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 58, in i1Ii1i
> 2023-11-16 11:19:41 ERROR vsmserver:     oOooo0OOO = asyncio . ensure_future (
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/tasks.py", line 649, in ensure_future
> 2023-11-16 11:19:41 ERROR vsmserver:     return _ensure_future(coro_or_future, loop=loop)
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/tasks.py", line 670, in _ensure_future
> 2023-11-16 11:19:41 ERROR vsmserver:     return loop.create_task(coro_or_future)

Adding a fake agent to 4.15.0 does not produce the ResourceWarning, and it seems to have been introduced in r40502.
Comment 80 Linn cendio 2023-11-22 09:43:39 CET
Through testing of error handling in bug 7636 I found this traceback (see bug 7636 comment 36 for original comment).

The traceback in that comment was done on RHEL 9 with server build 3412. The traceback below is from server build 3421, where asyncio has been removed. However, both tracebacks points to the same issue.
> 2023-11-21 13:33:14 ERROR vsmserver: Exception in callback call_method.<locals>.OO(<Future finis...xmlrpc.py:463>) at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:143
> 2023-11-21 13:33:14 ERROR vsmserver: handle: <Handle call_method.<locals>.OO(<Future finis...xmlrpc.py:463>) at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:143 created at /opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py:451>
> 2023-11-21 13:33:14 ERROR vsmserver: ----------------------------------------
> 2023-11-21 13:33:14 ERROR vsmserver: Object created at (most recent call last):
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 112, in <module>
> 2023-11-21 13:33:14 ERROR vsmserver:     o0OoOOo00OOO ( )
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 101, in o0OoOOo00OOO
> 2023-11-21 13:33:14 ERROR vsmserver:     I11I11I . run_forever ( )
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 601, in run_forever
> 2023-11-21 13:33:14 ERROR vsmserver:     self._run_once()
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 1897, in _run_once
> 2023-11-21 13:33:14 ERROR vsmserver:     handle._run()
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
> 2023-11-21 13:33:14 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 451, in i1iiiIii
> 2023-11-21 13:33:14 ERROR vsmserver:     I1II1IIIi . set_exception ( oo0 )
> 2023-11-21 13:33:14 ERROR vsmserver: ----------------------------------------
> 2023-11-21 13:33:14 ERROR vsmserver: Traceback (most recent call last):
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 147, in OO
> 2023-11-21 13:33:14 ERROR vsmserver:     II . set_result ( call . result ( ) )
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 449, in i1iiiIii
> 2023-11-21 13:33:14 ERROR vsmserver:     I1II1IIIi . set_result ( call . result ( ) )
> 2023-11-21 13:33:14 ERROR vsmserver: thinlinc.vsm.xmlrpc.ConnectionClosed: Connection closed unexpectedly
> 2023-11-21 13:33:14 ERROR vsmserver: 
> 2023-11-21 13:33:14 ERROR vsmserver: During handling of the above exception, another exception occurred:
> 2023-11-21 13:33:14 ERROR vsmserver: 
> 2023-11-21 13:33:14 ERROR vsmserver: Traceback (most recent call last):
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
> 2023-11-21 13:33:14 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 149, in OO
> 2023-11-21 13:33:14 ERROR vsmserver:     II . set_exception ( OO0000 )
> 2023-11-21 13:33:14 ERROR vsmserver: asyncio.exceptions.InvalidStateError: invalid state
> 2023-11-21 13:33:14 ERROR vsmserver: ----------------------------------------
> 2023-11-21 13:33:14 WARNING vsmserver.loadinfo: Error getting load from agent 127.0.0.1: Operation timed out
> 2023-11-21 13:33:14 WARNING vsmserver.loadinfo: Marking agent 127.0.0.1 as down
To trigger this error, I would simulate that incoming data was too long by setting the allowed maxdata to -1.
> class ServerProtocol():
>     ....
>     def data_received(self, data):
>         ...
>         self.max_datalength = -1
>         if self._datalength > self.max_datalength:
>             return
It is correct that we get a timeout for this scenario, but we should not get the traceback we see. Reopening.
Comment 81 Pierre Ossman cendio 2023-11-22 09:44:39 CET
Under the same scenario, we also get this if stopping the service:

> 2023-11-22 09:43:08 INFO vsmserver: Got SIGTERM, signaling process to quit
> 2023-11-22 09:43:08 INFO vsmserver: Terminating. Have a nice day!
> 2023-11-22 09:43:08 ERROR vsmserver: Task was destroyed but it is pending!
> 2023-11-22 09:43:08 ERROR vsmserver: task: <Task pending name='Task-8' coro=<wait_for() done, defined at /usr/lib64/python3.9/asyncio/tasks.py:421> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f741022cee0>()] created at /usr/lib64/python3.9/asyncio/base_events.py:429> cb=[iI1iI.<locals>.o0OOOoo00ooOo() at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:101] created at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:183>
> 2023-11-22 09:43:08 ERROR vsmserver: ----------------------------------------
> 2023-11-22 09:43:08 ERROR vsmserver: Object created at (most recent call last):
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 115, in <module>
> 2023-11-22 09:43:08 ERROR vsmserver:     o0OoOOo00OOO ( )
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 104, in o0OoOOo00OOO
> 2023-11-22 09:43:08 ERROR vsmserver:     iIiI11I . run_forever ( )
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 601, in run_forever
> 2023-11-22 09:43:08 ERROR vsmserver:     self._run_once()
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 1897, in _run_once
> 2023-11-22 09:43:08 ERROR vsmserver:     handle._run()
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
> 2023-11-22 09:43:08 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/loadbalancer.py", line 181, in _periodic_update
> 2023-11-22 09:43:08 ERROR vsmserver:     Oo0O00OOoo = vsmxmlrpccall . call_method ( ii1 , "get_load" )
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 183, in call_method
> 2023-11-22 09:43:08 ERROR vsmserver:     I1I1iIi = asyncio . ensure_future ( asyncio . wait_for ( II , timeout ) )
> sys:1: ResourceWarning: unclosed <socket.socket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.48.2.210', 1022), raddr=('10.48.2.210', 904)>
> ResourceWarning: Enable tracemalloc to get the object allocation traceback
> /usr/lib64/python3.9/asyncio/selector_events.py:702: ResourceWarning: unclosed transport <_SelectorSocketTransport fd=7>
>   _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
> ResourceWarning: Enable tracemalloc to get the object allocation traceback
Comment 82 Linn cendio 2023-11-22 16:10:19 CET
(In reply to Linn from comment #80)
> The traceback below is from server build 3421, where asyncio has been removed.
Typo, it is the file asyncbase that has been removed, not asyncio.
Comment 83 Pierre Ossman cendio 2023-11-30 09:15:43 CET
The InvalidStateError traceback is a race between two timeouts where we can end up trying to report both of them. And Futures have protection against trying to set a value twice, hence the crash.

In short, both the client and server end have the same timeout of 40 seconds. The client starts its timeout before the server, but it can be as little as just a couple of milliseconds earlier.

One timeout will invariably be detected first, since the servers are single-threaded. However, after either has been detected, there is a long chain of tasks and manually linked Futures before either timeout can reach the caller. Each step in this chain generally requires another iteration of the event loop. 

Unfortunately, these steps can take long enough that the few milliseconds difference between the timeouts is not enough to keep things from getting intertwined. Something the current code is not prepared to handle.

This issue does not¹ appear when using the async/await syntax. That method still has some chain steps, but it does not need to chain Futures manually. That results in the final reporting being done synchronously, in a single event loop iteration. Hence, no risk of interleaving and getting the problem we encounter here.

¹ Theoretically. Haven't actually put together a test case.
Comment 84 Linn cendio 2023-11-30 16:29:58 CET
Started looking into why the ResourceWarning in comment 81 happens.

Commit r40610 seems to have fixed part of this issue - there is no longer a ResourceWarning shown when triggered by the scenario described in comment 80. However, when adding a fake agent to /vsmserver/subclusters/Default/agents (as described in comment 77), the following ResourceWarning is triggered both before and after the fix. Note that tracemalloc is enabled here.

> 2023-11-30 12:30:20 ERROR vsmserver: Task was destroyed but it is pending!
> 2023-11-30 12:30:20 ERROR vsmserver: task: <Task pending name='Task-12' coro=<wait_for() done, defined at /usr/lib64/python3.9/asyncio/tasks.py:421> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f768ff9e730>()] created at /usr/lib64/python3.9/asyncio/base_events.py:429> cb
> =[iI1iI.<locals>.o0OOOoo00ooOo() at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:101] created at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:179>
> 2023-11-30 12:30:20 ERROR vsmserver: ----------------------------------------
> 2023-11-30 12:30:20 ERROR vsmserver: Object created at (most recent call last):
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 116, in <module>
> 2023-11-30 12:30:20 ERROR vsmserver:     o0OoOOo00OOO ( )
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 105, in o0OoOOo00OOO
> 2023-11-30 12:30:20 ERROR vsmserver:     iIiI11I . run_forever ( )
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 601, in run_forever
> 2023-11-30 12:30:20 ERROR vsmserver:     self._run_once()
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 1897, in _run_once
> 2023-11-30 12:30:20 ERROR vsmserver:     handle._run()
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
> 2023-11-30 12:30:20 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/loadbalancer.py", line 181, in _periodic_update
> 2023-11-30 12:30:20 ERROR vsmserver:     Oo0O00OOoo = vsmxmlrpccall . call_method ( ii1 , "get_load" )
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 179, in call_method
> 2023-11-30 12:30:20 ERROR vsmserver:     I1I1iIi = asyncio . ensure_future ( asyncio . wait_for ( II , timeout ) )
> sys:1: ResourceWarning: unclosed <socket.socket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 1023), raddr=('127.0.0.1', 904)>
> Object allocated at (most recent call last):
>   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", lineno 36
>     Oo000ooO0Oooo = socket . socket ( family = ii1iIi1i11i , type = I111IIi11IiI , proto = o0OOOooO00oo )
> /usr/lib64/python3.9/asyncio/selector_events.py:702: ResourceWarning: unclosed transport <_SelectorSocketTransport fd=6>
>   _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
> Object allocated at (most recent call last):
>   File "/usr/lib64/python3.9/asyncio/selector_events.py", lineno 61
>     return _SelectorSocketTransport(self, sock, protocol, waiter,

Note You need to log in before you can comment on or make changes to this bug.