Bug 8224 - vsm xml-rpc handling relies on deprecated asyncore
Summary: vsm xml-rpc handling relies on deprecated asyncore
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: Other (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.16.0
Assignee: Pierre Ossman
URL:
Keywords: adaha_tester, relnotes, samuel_tester
Depends on:
Blocks: 7636
  Show dependency treegraph
 
Reported: 2023-09-11 13:25 CEST by Pierre Ossman
Modified: 2024-02-05 16:12 CET (History)
4 users (show)

See Also:
Acceptance Criteria:
MUST: * The XML-RPC code no longer relies on the system asyncore or asynchat modules SHOULD: * The XML-RPC code uses asyncio terminology rather than asyncore such * The XML-RPC code has no ThinLinc code that is redundant with asyncio functionality * The XML-RPC code is converted to asyncio models and abstractions (e.g. protocols)


Attachments

Description Pierre Ossman cendio 2023-09-11 13:25:30 CEST
Broken out from bug 7636. See that bug for details.

The VSM code has both server and client handling for XML-RPC. Both need to be converted to use asyncio instead of asyncore.
Comment 12 Pierre Ossman cendio 2023-09-19 11:37:12 CEST
This conversion is a bit problematic as asyncore has a strongly object oriented design, whilst asyncio uses a mix of functions and objects.

To make this transition easier, we'll probably need to more clearly separate and reorganise some responsibilites in the code. E.g. we can't rely on subclassing and overriding methods as much, so other means of passing information need to be used.

The plan is to make the core xml-rpc code more ThinLinc agnostic and treat it more like an external lib. That should hopefully make things easier to follow.
Comment 34 Pierre Ossman cendio 2023-10-03 10:59:52 CEST
The VSMServer/VSMAgent classes are currently used as a "hub" for everything else. I.e. that's how all other code finds configuration, logging, session database, etc.

This doesn't fit well in to asyncio as you don't have your own subclass for the TCP servers. So we need to reorganise things some other way.

I'm leaning towards moving most things to simple functions and variables around main() instead. I think this fits well with the XML-RPC protocol, as that is function-based, not object-based.
Comment 48 Adam Halim cendio 2023-10-17 11:33:08 CEST
Tested raising an exception in xmlrpc callback, noticed a change compared to 4.15.0. Previously, an exception from an xmlrpc callback would start a session when a client connected. Currently, in build 3398, a session is not created when an xmlrpc callback crashes. This was tested by raising an exception in call_unbindports.py::handle_returnvalue() in 4.15.0, and in loginhandler_common.py::_unbind_ports_done() in build 3398.

Also, in 4.15.0, the error traceback message prints ". Marking agent as down.", but the agent is not down. This is probably a bug in 4.15.0.

4.15.0 log:

> 2023-10-16 13:57:06 ERROR vsmserver.session: Unhandled exception trying to unbind ports for display 12 on VSM Agent 127.0.0.1:904: <class 'Exception'> Exception! Traceback (most recent call last):
>             File "/opt/thinlinc/modules/thinlinc/vsm/asyncbase.py", line 104, in ooOOOoOO0
>               obj . handle_read_event ( )
>             File "/usr/lib64/python3.11/asyncore.py", line 425, in handle_read_event
>               self.handle_read()
>             File "/usr/lib64/python3.11/asynchat.py", line 158, in handle_read
>               self.found_terminator()
>             File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 422, in found_terminator
>               self . handle_response ( )
>             File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 454, in handle_response
>               self . handle_returnvalue ( )
>             File "/opt/thinlinc/modules/thinlinc/vsm/call_unbindports.py", line 41, in handle_returnvalue
>               raise Exception("Exception!")
>           Exception: Exception!
>           . Marking agent as down.

Build 3398 log:

> 2023-10-16 13:55:31 ERROR vsmserver: Unhandled event loop exception: Exception in callback LoginHandlerCommon._unbind_ports_done(<Future finis...pccall.py:116>)
> 2023-10-16 13:55:31 ERROR vsmserver: handle: <Handle LoginHandlerCommon._unbind_ports_done(<Future finis...pccall.py:116>) created at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:102>
> 2023-10-16 13:55:31 ERROR vsmserver: ----------------------------------------
> 2023-10-16 13:55:31 ERROR vsmserver: Object created at (most recent call last):
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 74, in <module>
> 2023-10-16 13:55:31 ERROR vsmserver:     iIi1ii1I1iI11 ( )
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 63, in iIi1ii1I1iI11
> 2023-10-16 13:55:31 ERROR vsmserver:     OoO . loop ( )
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/asyncbase.py", line 224, in loop
> 2023-10-16 13:55:31 ERROR vsmserver:     self . run_forever ( )
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/base_events.py", line 607, in run_forever
> 2023-10-16 13:55:31 ERROR vsmserver:     self._run_once()
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/base_events.py", line 1914, in _run_once
> 2023-10-16 13:55:31 ERROR vsmserver:     handle._run()
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/events.py", line 80, in _run
> 2023-10-16 13:55:31 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 102, in o0OOOoo00ooOo
> 2023-10-16 13:55:31 ERROR vsmserver:     I1 . set_result ( future . result ( ) )
> 2023-10-16 13:55:31 ERROR vsmserver: ----------------------------------------
> 2023-10-16 13:55:31 ERROR vsmserver: Traceback (most recent call last):
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/events.py", line 80, in _run
> 2023-10-16 13:55:31 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-10-16 13:55:31 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/loginhandler_common.py", line 57, in _unbind_ports_done
> 2023-10-16 13:55:31 ERROR vsmserver:     raise Exception("Exception!")
> 2023-10-16 13:55:31 ERROR vsmserver: Exception: Exception!
> 2023-10-16 13:55:31 ERROR vsmserver: ----------------------------------------
Comment 60 Pierre Ossman cendio 2023-10-26 16:15:50 CEST
asyncio's UNIX socket server apparently doesn't clean up the socket after itself. I've filed an issue here:

https://github.com/python/cpython/issues/111246

I guess we'll have to do it manually for now.
Comment 61 Pierre Ossman cendio 2023-10-27 10:15:50 CEST
The existing code for user sockets puts significant effort in to detecting if a socket has been replaced. Unfortunately, I cannot find any reasoning as to why. The socket directory is a privileged directory specific for ThinLinc, so there shouldn't be anything else writing stuff there.

I'm opting for not porting all that complexity over and just letting the code assume that the socket will remain in place.
Comment 71 Samuel Mannehed cendio 2023-11-02 13:00:25 CET
While testing bug 5556, we found that exceptions from xmlrcp.ClientProtocol._handle_response() are insufficiently logged. They are simply converted to strings, no traceback is logged.

We applied following diff:

> M vsm/modules/thinlinc/vsm/xmlrpc.py
> @@ -183,6 +183,7 @@ class ClientProtocol(asyncio.Protocol):
>              self._handle_response()
>  
>      def _handle_response(self):
> +        raise ValueError("WHOOPS")
>          """Decode XMLRPC response and call L{handle_response}, then
>          close the connection.

This only results in log lines similar to these:

> 2023-11-02 11:27:18 WARNING vsmserver: Error checking if agent 127.0.0.1 is alive: WHOOPS
> 2023-11-02 11:27:28 WARNING vsmserver.loadinfo: Error getting load from agent 127.0.0.1: WHOOPS

We probably want more details in the logs here. We haven't tested a similar scenario in the 4.15.0 code, but the handling that prints these lines is new as of commit r40504 for this bug.

A brief review of that commit indicates that the old code printed more details such as tracebacks for these types of exceptions.
Comment 73 Pierre Ossman cendio 2023-11-13 16:59:02 CET
> SHOULD:
>
> * The XML-RPC code is converted to asyncio models and abstractions (e.g. protocols)

One new asyncio feature is the ability to cancel background processes. We've rigged up the new code so that dropped XML-RPC connections will try to cancel the running method handler.

However, it's too much work to add this support to all the different handlers at this time. So we filter out the cancel signal and always let the handlers run to completion. If the connection has been dropped, then we simply throw away the result.

This is identical to the previous asyncore behaviour, so we are not worse off than before.

Added bug 8255 for improving this situation.
Comment 74 Pierre Ossman cendio 2023-11-14 09:29:52 CET
(In reply to Samuel Mannehed from comment #71)
> 
> We probably want more details in the logs here. We haven't tested a similar
> scenario in the 4.15.0 code, but the handling that prints these lines is new
> as of commit r40504 for this bug.
> 
> A brief review of that commit indicates that the old code printed more
> details such as tracebacks for these types of exceptions.

I'm not sure if it's worth the complexity.

The old code caught exceptions from both the core XML-RPC code and the individual method handlers in the same place. The new code separates the two, so we don't have to treat them the same. This allows us to not have to have an extra catch-all exception handler in the XML-RPC code. The downside is that it gets messy if we want to treat different exceptions from the core XML-RPC code in different ways.

We have these cases:

                      Old                 New
===========================================================
XML-RPC I/O Error     Logged, one line    Logged, one line
XML-RPC XML error     Logged, one line    Logged, one line
XML-RPC other error   Logged, traceback   Logged, one line  <=== New bad
Handler I/O Error     Logged, one line    Logged, traceback <=== Old bad
Handler other error   Logged, traceback   Logged, traceback
===========================================================

If we want to get back more of the old behaviour, we would need to check which exception we got back from each XML-RPC call. That would either mean duplication to every place we call stuff, or adding another layer of abstraction, so we can centralise things.

I'm uncertain if that is worth it, given that the core XML-RPC code is fairly small and should see very few changes. We could treat it more like something from the standard library.
Comment 75 Pierre Ossman cendio 2023-11-14 11:00:27 CET
(In reply to Adam Halim from comment #48)
> Tested raising an exception in xmlrpc callback, noticed a change compared to
> 4.15.0. Previously, an exception from an xmlrpc callback would start a
> session when a client connected. Currently, in build 3398, a session is not
> created when an xmlrpc callback crashes. This was tested by raising an
> exception in call_unbindports.py::handle_returnvalue() in 4.15.0, and in
> loginhandler_common.py::_unbind_ports_done() in build 3398.
> 

The two tests were not equivalent, hence the differing results. If I raise an exception in an equivalent place in the new code, I get:

> 2023-11-14 10:58:49 WARNING vsmserver: Error unbinding ports for session lab-210.lkpg.cendio.se:11: Exception!

And it proceeds to connect.
Comment 76 Pierre Ossman cendio 2023-11-14 14:12:21 CET
Everything XML-RPC should be converted now.

> MUST:
> 
>  * The XML-RPC code no longer relies on the system asyncore or asynchat modules
>

Yup. There are no traces left of either asyncore or asynchat.

> SHOULD:
> 
>  * The XML-RPC code uses asyncio terminology rather than asyncore such
>

I think so. We refer to protocols, transports, futures and callbacks now.

>  * The XML-RPC code is no ThinLinc code that is redundant with asyncio functionality

Yes, except for having to "fix" asyncio.TimeoutError because the standard one lacks a description.

>  * The XML-RPC code is converted to asyncio models and abstractions (e.g. protocols)

Mostly. We have two possible remaining changes:

 * Method handlers don't respect the cancel signal (bug 8255)

 * Method handlers should probably have a function API rather than an object one, given that interaction is a simple call and wait for the result

These were deemed too large to do now, though, and is a matter of code clean-up rather than any tangible functionality.
Comment 77 Adam Halim cendio 2023-11-16 14:43:02 CET
I tested adding a fake agent to vsmserver.hconf which would leave some sockets opened after a timeout was triggered.

In my case, getting agent load would time out and produce the following in vsmserver.log:
> /usr/lib64/python3.11/asyncio/base_events.py:1907: ResourceWarning: unclosed <socket.socket fd=7, family=2, type=1, proto=6, laddr=('10.48.2.199', 1022)>
>   handle = self._ready.popleft()
> ResourceWarning: Enable tracemalloc to get the object allocation traceback
> 2023-11-16 11:18:59 WARNING vsmserver.loadinfo: Error getting load from agent 192.0.0.0: Operation timed out
> 2023-11-16 11:18:59 WARNING vsmserver.loadinfo: Marking agent 192.0.0.0 as down
The ResourceWarning is only printed once, but we get the timeout error periodically.

Killing/restarting the server would sometimes dump the following in the log:
> 2023-11-16 11:19:41 INFO vsmserver: Terminating. Have a nice day!
> 2023-11-16 11:19:41 ERROR vsmserver: Task was destroyed but it is pending!
> 2023-11-16 11:19:41 ERROR vsmserver: task: <Task pending name='Task-19' coro=<wait_for() done, defined at /usr/lib64/python3.11/asyncio/tasks.py:426> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/lib64/python3.11/asyncio/base_events.py:427> cb=[iI1iI.<locals>.o0OOOoo00ooOo() at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:100] created at /usr/lib64/python3.11/asyncio/tasks.py:670>
> 2023-11-16 11:19:41 ERROR vsmserver: ----------------------------------------
> 2023-11-16 11:19:41 ERROR vsmserver: Object created at (most recent call last):
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 112, in <module>
> 2023-11-16 11:19:41 ERROR vsmserver:     o0OoOOo00OOO ( )
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 101, in o0OoOOo00OOO
> 2023-11-16 11:19:41 ERROR vsmserver:     I11I11I . run_forever ( )
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/base_events.py", line 607, in run_forever
> 2023-11-16 11:19:41 ERROR vsmserver:     self._run_once()
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/base_events.py", line 1914, in _run_once
> 2023-11-16 11:19:41 ERROR vsmserver:     handle._run()
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/events.py", line 80, in _run
> 2023-11-16 11:19:41 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/loadbalancer.py", line 181, in _periodic_update
> 2023-11-16 11:19:41 ERROR vsmserver:     Oo0O00OOoo = vsmxmlrpccall . call_method ( ii1 , "get_load" )
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 178, in call_method
> 2023-11-16 11:19:41 ERROR vsmserver:     I1I1iIi = asyncio . ensure_future ( asyncio . wait_for ( II , timeout ) )
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/tasks.py", line 649, in ensure_future
> 2023-11-16 11:19:41 ERROR vsmserver:     return _ensure_future(coro_or_future, loop=loop)
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/tasks.py", line 670, in _ensure_future
> 2023-11-16 11:19:41 ERROR vsmserver:     return loop.create_task(coro_or_future)
> sys:1: ResourceWarning: unclosed <socket.socket fd=7, family=2, type=1, proto=6, laddr=('10.48.2.199', 1023)>
> ResourceWarning: Enable tracemalloc to get the object allocation traceback
> 2023-11-16 11:19:41 ERROR vsmserver: Task was destroyed but it is pending!
> 2023-11-16 11:19:41 ERROR vsmserver: task: <Task pending name='Task-22' coro=<BaseSelectorEventLoop.sock_connect() done, defined at /usr/lib64/python3.11/asyncio/selector_events.py:614> wait_for=<Future pending cb=[BaseSelectorEventLoop._sock_write_done(7, handle=<Handle BaseS...events.py:300>)(), Task.task_wakeup()] created at /usr/lib64/python3.11/asyncio/base_events.py:427> cb=[iI.<locals>.O0ooooOoo0O() at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:63] created at /usr/lib64/python3.11/asyncio/tasks.py:670>
> 2023-11-16 11:19:41 ERROR vsmserver: ----------------------------------------
> 2023-11-16 11:19:41 ERROR vsmserver: Object created at (most recent call last):
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 112, in <module>
> 2023-11-16 11:19:41 ERROR vsmserver:     o0OoOOo00OOO ( )
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 101, in o0OoOOo00OOO
> 2023-11-16 11:19:41 ERROR vsmserver:     I11I11I . run_forever ( )
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/base_events.py", line 607, in run_forever
> 2023-11-16 11:19:41 ERROR vsmserver:     self._run_once()
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/base_events.py", line 1914, in _run_once
> 2023-11-16 11:19:41 ERROR vsmserver:     handle._run()
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/events.py", line 80, in _run
> 2023-11-16 11:19:41 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 58, in i1Ii1i
> 2023-11-16 11:19:41 ERROR vsmserver:     oOooo0OOO = asyncio . ensure_future (
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/tasks.py", line 649, in ensure_future
> 2023-11-16 11:19:41 ERROR vsmserver:     return _ensure_future(coro_or_future, loop=loop)
> 2023-11-16 11:19:41 ERROR vsmserver:   File "/usr/lib64/python3.11/asyncio/tasks.py", line 670, in _ensure_future
> 2023-11-16 11:19:41 ERROR vsmserver:     return loop.create_task(coro_or_future)

Adding a fake agent to 4.15.0 does not produce the ResourceWarning, and it seems to have been introduced in r40502.
Comment 80 Linn cendio 2023-11-22 09:43:39 CET
Through testing of error handling in bug 7636 I found this traceback (see bug 7636 comment 36 for original comment).

The traceback in that comment was done on RHEL 9 with server build 3412. The traceback below is from server build 3421, where asyncio has been removed. However, both tracebacks points to the same issue.
> 2023-11-21 13:33:14 ERROR vsmserver: Exception in callback call_method.<locals>.OO(<Future finis...xmlrpc.py:463>) at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:143
> 2023-11-21 13:33:14 ERROR vsmserver: handle: <Handle call_method.<locals>.OO(<Future finis...xmlrpc.py:463>) at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:143 created at /opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py:451>
> 2023-11-21 13:33:14 ERROR vsmserver: ----------------------------------------
> 2023-11-21 13:33:14 ERROR vsmserver: Object created at (most recent call last):
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 112, in <module>
> 2023-11-21 13:33:14 ERROR vsmserver:     o0OoOOo00OOO ( )
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 101, in o0OoOOo00OOO
> 2023-11-21 13:33:14 ERROR vsmserver:     I11I11I . run_forever ( )
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 601, in run_forever
> 2023-11-21 13:33:14 ERROR vsmserver:     self._run_once()
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 1897, in _run_once
> 2023-11-21 13:33:14 ERROR vsmserver:     handle._run()
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
> 2023-11-21 13:33:14 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 451, in i1iiiIii
> 2023-11-21 13:33:14 ERROR vsmserver:     I1II1IIIi . set_exception ( oo0 )
> 2023-11-21 13:33:14 ERROR vsmserver: ----------------------------------------
> 2023-11-21 13:33:14 ERROR vsmserver: Traceback (most recent call last):
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 147, in OO
> 2023-11-21 13:33:14 ERROR vsmserver:     II . set_result ( call . result ( ) )
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 449, in i1iiiIii
> 2023-11-21 13:33:14 ERROR vsmserver:     I1II1IIIi . set_result ( call . result ( ) )
> 2023-11-21 13:33:14 ERROR vsmserver: thinlinc.vsm.xmlrpc.ConnectionClosed: Connection closed unexpectedly
> 2023-11-21 13:33:14 ERROR vsmserver: 
> 2023-11-21 13:33:14 ERROR vsmserver: During handling of the above exception, another exception occurred:
> 2023-11-21 13:33:14 ERROR vsmserver: 
> 2023-11-21 13:33:14 ERROR vsmserver: Traceback (most recent call last):
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
> 2023-11-21 13:33:14 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-11-21 13:33:14 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 149, in OO
> 2023-11-21 13:33:14 ERROR vsmserver:     II . set_exception ( OO0000 )
> 2023-11-21 13:33:14 ERROR vsmserver: asyncio.exceptions.InvalidStateError: invalid state
> 2023-11-21 13:33:14 ERROR vsmserver: ----------------------------------------
> 2023-11-21 13:33:14 WARNING vsmserver.loadinfo: Error getting load from agent 127.0.0.1: Operation timed out
> 2023-11-21 13:33:14 WARNING vsmserver.loadinfo: Marking agent 127.0.0.1 as down
To trigger this error, I would simulate that incoming data was too long by setting the allowed maxdata to -1.
> class ServerProtocol():
>     ....
>     def data_received(self, data):
>         ...
>         self.max_datalength = -1
>         if self._datalength > self.max_datalength:
>             return
It is correct that we get a timeout for this scenario, but we should not get the traceback we see. Reopening.
Comment 81 Pierre Ossman cendio 2023-11-22 09:44:39 CET
Under the same scenario, we also get this if stopping the service:

> 2023-11-22 09:43:08 INFO vsmserver: Got SIGTERM, signaling process to quit
> 2023-11-22 09:43:08 INFO vsmserver: Terminating. Have a nice day!
> 2023-11-22 09:43:08 ERROR vsmserver: Task was destroyed but it is pending!
> 2023-11-22 09:43:08 ERROR vsmserver: task: <Task pending name='Task-8' coro=<wait_for() done, defined at /usr/lib64/python3.9/asyncio/tasks.py:421> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f741022cee0>()] created at /usr/lib64/python3.9/asyncio/base_events.py:429> cb=[iI1iI.<locals>.o0OOOoo00ooOo() at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:101] created at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:183>
> 2023-11-22 09:43:08 ERROR vsmserver: ----------------------------------------
> 2023-11-22 09:43:08 ERROR vsmserver: Object created at (most recent call last):
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 115, in <module>
> 2023-11-22 09:43:08 ERROR vsmserver:     o0OoOOo00OOO ( )
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 104, in o0OoOOo00OOO
> 2023-11-22 09:43:08 ERROR vsmserver:     iIiI11I . run_forever ( )
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 601, in run_forever
> 2023-11-22 09:43:08 ERROR vsmserver:     self._run_once()
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 1897, in _run_once
> 2023-11-22 09:43:08 ERROR vsmserver:     handle._run()
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
> 2023-11-22 09:43:08 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/loadbalancer.py", line 181, in _periodic_update
> 2023-11-22 09:43:08 ERROR vsmserver:     Oo0O00OOoo = vsmxmlrpccall . call_method ( ii1 , "get_load" )
> 2023-11-22 09:43:08 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 183, in call_method
> 2023-11-22 09:43:08 ERROR vsmserver:     I1I1iIi = asyncio . ensure_future ( asyncio . wait_for ( II , timeout ) )
> sys:1: ResourceWarning: unclosed <socket.socket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.48.2.210', 1022), raddr=('10.48.2.210', 904)>
> ResourceWarning: Enable tracemalloc to get the object allocation traceback
> /usr/lib64/python3.9/asyncio/selector_events.py:702: ResourceWarning: unclosed transport <_SelectorSocketTransport fd=7>
>   _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
> ResourceWarning: Enable tracemalloc to get the object allocation traceback
Comment 82 Linn cendio 2023-11-22 16:10:19 CET
(In reply to Linn from comment #80)
> The traceback below is from server build 3421, where asyncio has been removed.
Typo, it is the file asyncbase that has been removed, not asyncio.
Comment 83 Pierre Ossman cendio 2023-11-30 09:15:43 CET
The InvalidStateError traceback is a race between two timeouts where we can end up trying to report both of them. And Futures have protection against trying to set a value twice, hence the crash.

In short, both the client and server end have the same timeout of 40 seconds. The client starts its timeout before the server, but it can be as little as just a couple of milliseconds earlier.

One timeout will invariably be detected first, since the servers are single-threaded. However, after either has been detected, there is a long chain of tasks and manually linked Futures before either timeout can reach the caller. Each step in this chain generally requires another iteration of the event loop. 

Unfortunately, these steps can take long enough that the few milliseconds difference between the timeouts is not enough to keep things from getting intertwined. Something the current code is not prepared to handle.

This issue does not¹ appear when using the async/await syntax. That method still has some chain steps, but it does not need to chain Futures manually. That results in the final reporting being done synchronously, in a single event loop iteration. Hence, no risk of interleaving and getting the problem we encounter here.

¹ Theoretically. Haven't actually put together a test case.
Comment 84 Linn cendio 2023-11-30 16:29:58 CET Comment hidden (obsolete)
Comment 88 Linn cendio 2023-12-07 09:53:43 CET
Looked into the ResourceWarning and to clarify, to reproduce it I needed to set self.max_datalength = -1 (see comment 80), and the resource warning is triggered when restarting the vsmserver service. Depending on the Python version being used, the resource warnings look slightly different.

Warning for Python 3.9.16 (RHEL 9), with tracemalloc enabled
> 2023-11-30 12:30:20 ERROR vsmserver: Task was destroyed but it is pending!
> 2023-11-30 12:30:20 ERROR vsmserver: task: <Task pending name='Task-12' coro=<wait_for() done, defined at /usr/lib64/python3.9/asyncio/tasks.py:421> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f768ff9e730>()] created at /usr/lib64/python3.9/asyncio/base_events.py:429> cb
> =[iI1iI.<locals>.o0OOOoo00ooOo() at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:101] created at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:179>
> 2023-11-30 12:30:20 ERROR vsmserver: ----------------------------------------
> 2023-11-30 12:30:20 ERROR vsmserver: Object created at (most recent call last):
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 116, in <module>
> 2023-11-30 12:30:20 ERROR vsmserver:     o0OoOOo00OOO ( )
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 105, in o0OoOOo00OOO
> 2023-11-30 12:30:20 ERROR vsmserver:     iIiI11I . run_forever ( )
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 601, in run_forever
> 2023-11-30 12:30:20 ERROR vsmserver:     self._run_once()
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 1897, in _run_once
> 2023-11-30 12:30:20 ERROR vsmserver:     handle._run()
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
> 2023-11-30 12:30:20 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/loadbalancer.py", line 181, in _periodic_update
> 2023-11-30 12:30:20 ERROR vsmserver:     Oo0O00OOoo = vsmxmlrpccall . call_method ( ii1 , "get_load" )
> 2023-11-30 12:30:20 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 179, in call_method
> 2023-11-30 12:30:20 ERROR vsmserver:     I1I1iIi = asyncio . ensure_future ( asyncio . wait_for ( II , timeout ) )
> sys:1: ResourceWarning: unclosed <socket.socket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 1023), raddr=('127.0.0.1', 904)>
> Object allocated at (most recent call last):
>   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", lineno 36
>     Oo000ooO0Oooo = socket . socket ( family = ii1iIi1i11i , type = I111IIi11IiI , proto = o0OOOooO00oo )
> /usr/lib64/python3.9/asyncio/selector_events.py:702: ResourceWarning: unclosed transport <_SelectorSocketTransport fd=6>
>   _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
> Object allocated at (most recent call last):
>   File "/usr/lib64/python3.9/asyncio/selector_events.py", lineno 61
>     return _SelectorSocketTransport(self, sock, protocol, waiter,
Warning for Python 3.4.10 (SLES 12). The value of tracemalloc had no impact on the logging.
> /usr/lib64/python3.4/asyncio/selector_events.py:576: ResourceWarning: unclosed transport <_SelectorSocketTransport fd=7> warnings.warn("unclosed transport %r" % self, ResourceWarning)
> 2023-12-05 11:27:07 ERROR vsmserver: Task was destroyed but it is pending!
> 2023-12-05 11:27:07 ERROR vsmserver: task: <Task pending coro=<wait_for() running at /usr/lib64/python3.4/asyncio/tasks.py:386> wait_for=<Future pending cb=[Task._wakeup()]> cb=[iI1iI.<locals>.o0OOOoo00ooOo() at /opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py:101]>
Comment 94 Linn cendio 2023-12-08 10:51:34 CET
In order to remove the resource warning, I looked at how asyncio.run() handles it close [1]. This function is the current recommended way of running asynchronous calls in Python.

We are cancelling tasks in same way as run(), but we do not any additional logging if the tasks could not be cancelled properly. We also added a timeout so we don't wait forever on tasks that might have hanged or otherwise takes a long time shutting down. The time for timeout was chosen to be small enough so a timeout when restarting vsmserver or vsmagent hardly should be noticed, but still give tasks a little bit of time to finish.

Some things in asyncio.run() was purposfully left out of our implementation. We do not call loop.shutdown_asyncgens() [2] or loop.shutdown_default_executor() [3]. These functions came in Python 3.6 and 3.9, respectively, and they way we use asyncio we do not use the things they shutdown. 


1: https://github.com/python/cpython/blob/3.12/Lib/asyncio/runners.py#L64
2: https://python.readthedocs.io/en/stable/library/asyncio-eventloop.html#asyncio.AbstractEventLoop.shutdown_asyncgens
3: https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.shutdown_default_executor
Comment 95 Linn cendio 2023-12-08 11:13:52 CET
Tested the resource warning on RHEL 9 (Python 3.9), and the warning no longer shows up in the logs when restarting vsmserver.

Also tested vsmserver by manually adding a future to the list of remaining tasks that would take a long time to finish and simulated a failed cancel by commenting out the cancel call. Could see that the service restarted quickly. Could also see that the timeout actually was triggered by setting it to 10s, and noticing a 10s delay in vsmserver when restarting it.

With that, the resource warning part of this bug should be ready for testing.
Comment 96 Pierre Ossman cendio 2023-12-08 17:12:42 CET
The issue with racing Futures and "InvalidStateError" should now be resolved. Tested on RHEL 9 where I can no longer provoke the issue.

That should be all the known issues. Resolving.
Comment 97 Pierre Ossman cendio 2023-12-11 09:10:32 CET
There are still some complaints in the race case:

> 2023-12-10 00:00:29 ERROR vsmserver: Future exception was never retrieved
> 2023-12-10 00:00:29 ERROR vsmserver: future: <Future finished exception=ConnectionClosed('Connection closed unexpectedly') created at /opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py:469>
> 2023-12-10 00:00:29 ERROR vsmserver: ----------------------------------------
> 2023-12-10 00:00:29 ERROR vsmserver: Object created at (most recent call last):
> 2023-12-10 00:00:29 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 140, in <module>
> 2023-12-10 00:00:29 ERROR vsmserver:     o0OoOOo00OOO ( )
> 2023-12-10 00:00:29 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 107, in o0OoOOo00OOO
> 2023-12-10 00:00:29 ERROR vsmserver:     iIIiii1iI . run_forever ( )
> 2023-12-10 00:00:29 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 601, in run_forever
> 2023-12-10 00:00:29 ERROR vsmserver:     self._run_once()
> 2023-12-10 00:00:29 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 1897, in _run_once
> 2023-12-10 00:00:29 ERROR vsmserver:     handle._run()
> 2023-12-10 00:00:29 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
> 2023-12-10 00:00:29 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-12-10 00:00:29 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmxmlrpccall.py", line 165, in i1I11ii
> 2023-12-10 00:00:29 ERROR vsmserver:     I11I1iI = xmlrpc . call_method ( url , methodname , * params , sock = Oo000ooO0Oooo )
> 2023-12-10 00:00:29 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 469, in call_method
> 2023-12-10 00:00:29 ERROR vsmserver:     I1I11i = asyncio . Future ( )
> 2023-12-10 00:00:29 ERROR vsmserver: ----------------------------------------
> 2023-12-10 00:00:29 ERROR vsmserver: Traceback (most recent call last):
> 2023-12-10 00:00:29 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 455, in iiiIII
> 2023-12-10 00:00:29 ERROR vsmserver:     I1I11i . set_result ( call . result ( ) )
> 2023-12-10 00:00:29 ERROR vsmserver: thinlinc.vsm.xmlrpc.ConnectionClosed: Connection closed unexpectedly
> 2023-12-10 00:00:29 ERROR vsmserver: ----------------------------------------
Comment 98 Linn cendio 2023-12-11 11:11:51 CET
There is a theory that calling cancel a bit more aggressively could get rid of the error seen in comment 97. I made a prototype to see what happened when calling cancel on a future with an exception already set.

On Python 3.6.8 and 3.9.16 (RHEL 8 and RHEL 9), the warning "Future exception was never retrieved" no longer shows up if calling cancel() on the future after the exception was set.

For the same scenario on Python 3.4.10 (SLES 12), the warning shows up regardless of calling cancel() or not. The warning does disappear when calling exception() on the future.

Since SLES 12 is relatively soon reaching EOL, this solution could still be a suitable one since all target platforms except for SLES 12 should benefit from it.
Comment 100 Linn cendio 2023-12-12 14:16:36 CET
Tested triggering the race error on RHEL 9 and SLES 12 after being more generous with our cancel() calls, and could see that the race triggered periodically no longer seems to be present on RHEL 9.

On SLES 12, the error still triggers periodically and gives this traceback in the logs. This is the compromise we settled on for this edge case - for Python <= 3.6 this traceback should no longer happen.
> 2023-12-12 12:38:43 ERROR vsmserver: Future exception was never retrieved
> 2023-12-12 12:38:43 ERROR vsmserver: future: <Future finished exception=ConnectionClosed('Connection closed unexpectedly',)>
> 2023-12-12 12:38:43 ERROR vsmserver: ----------------------------------------
> 2023-12-12 12:38:43 ERROR vsmserver: Traceback (most recent call last):
> 2023-12-12 12:38:43 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 455, in iiiIII
> 2023-12-12 12:38:43 ERROR vsmserver:     I1I11i . set_result ( call . result ( ) )
> 2023-12-12 12:38:43 ERROR vsmserver:   File "/usr/lib64/python3.4/asyncio/futures.py", line 274, in result
> 2023-12-12 12:38:43 ERROR vsmserver:     raise self._exception
> 2023-12-12 12:38:43 ERROR vsmserver: thinlinc.vsm.xmlrpc.ConnectionClosed: Connection closed unexpectedly
> 2023-12-12 12:38:43 ERROR vsmserver: ----------------------------------------

With that, this bug is ready for testing.
Comment 102 Adam Halim cendio 2023-12-14 14:53:16 CET
Tested using SLES12 (Python 3.4.10) and GNOME 3.20.2

VerifySessions is run in several scenarios, one of which being whenever the
"Select session" window is displayed on the client. 

Here are some scenarios I tested (Client 1 & 2 connect with the same username, max_sessions_per_user>1):

1. Client 1 creates a new session (ID 10) and disconnects.
2. Client 2 connects and the "Select session" window is displayed. Session 10
   has status disconnected.
3. Client 1 attempts to connect again, but this time session 10 is marked
   unreachable (I added a time.sleep(40) in VerifySessions to make it
   unreachable).
4. Client 2 attempts to connect to session 10 which, for Client 2, still has
   status disconnected.
5. Client 2 is greeted with a "Connection Error" window ("The session is in a
   state where the requested action isn't possible"). In vsmserver.log, the 
   following messages are logged:
> 2023-12-14 13:36:28 WARNING vsmserver: Error verifying sessions on agent 127.0.0.1: Operation timed out
> 2023-12-14 13:36:28 INFO vsmserver.session: Session 127.0.0.1:10 for cendio could not be verified. Marking as unreachable.
> 2023-12-14 13:36:28 WARNING vsmserver.session: Tried to reconnect to session 127.0.0.1:10 for cendio with unknown state unreachable
Another scenario tested:

1. Client 1 creates a new session (ID 10) and disconnects.
2. Client 1 connects again and the "Select session" window is displayed.
   Session 10 has status disconnected.
3. Session 10 is killed using tlctl.
4. Client 1 connects to session 10 (that was just killed).
5. A new session is created, with ID 10.

Everything looks good and works as expected.
Comment 103 Pierre Ossman cendio 2023-12-14 16:18:32 CET
Something is broken with the auth checks. I got this crash calling 'get_sessions' on vsmserver:

> 2023-12-14 16:14:08 ERROR vsmserver: ----------------------------------------
> 2023-12-14 16:14:08 ERROR vsmserver: Traceback (most recent call last):
> 2023-12-14 16:14:08 ERROR vsmserver:   File "/usr/lib64/python3.12/asyncio/events.py", line 84, in _run
> 2023-12-14 16:14:08 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-12-14 16:14:08 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 337, in _call_callback
> 2023-12-14 16:14:08 ERROR vsmserver:     self . _response = self . _callback ( methodname , * params )
> 2023-12-14 16:14:08 ERROR vsmserver:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 2023-12-14 16:14:08 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 299, in _handle_method
> 2023-12-14 16:14:08 ERROR vsmserver:     oOooO0O0 = i1IIiiI1III1iI ( clientaddr = O0O0oO00OO0 )
> 2023-12-14 16:14:08 ERROR vsmserver:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 2023-12-14 16:14:08 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 277, in <lambda>
> 2023-12-14 16:14:08 ERROR vsmserver:     lambda ** Oo0o0Oo : OOOoO0oO . is_lowport ( ** Oo0o0Oo )
> 2023-12-14 16:14:08 ERROR vsmserver:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 2023-12-14 16:14:08 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/authoritychecker.py", line 56, in is_lowport
> 2023-12-14 16:14:08 ERROR vsmserver:     I1i , ii1IiIiiII = clientaddr
> 2023-12-14 16:14:08 ERROR vsmserver:     ^^^^^^^^^^^^^^^^
> 2023-12-14 16:14:08 ERROR vsmserver: ValueError: too many values to unpack (expected 2)
> 2023-12-14 16:14:08 ERROR vsmserver: ----------------------------------------
Comment 107 Pierre Ossman cendio 2023-12-15 10:41:41 CET
The crash was because we could now get IPv6 connections, which the XML-RPC code was not prepared for. It should now be fixed and working fine for both IPv4 and IPv6.
Comment 109 Pierre Ossman cendio 2023-12-18 13:13:31 CET
Verified that 'get_sessions' works as intended. Tested on Fedora 39 using "tlctl session terminate". It does 'get_sessions' both with and without verifying sessions, so both code paths get tested. Tried without filters, and various combinations of filters.

For good measure, I also tested with the agent stopped so that the verification failed. No issues seen there either, and the sessions were reported as unreachable.

Note that this only tests 'get_session' via a user socket. It's also available via TCP, but that is just for the automatic tests, and they don't seem to have any complaints.
Comment 110 Adam Halim cendio 2023-12-18 13:29:13 CET
Tested using SLES12 (Python 3.4.10)

To test 'unbind_ports', I ran nc -l 5000 and started a new session:
> 2023-12-14 16:09:56 INFO vsmagent.session: Killing processes using ports in interval 5000-5006 (display 10): nc(12257)
The session was created without any issue, and the nc process was killed.
Comment 111 Samuel Mannehed cendio 2023-12-18 15:58:39 CET
Tested some handlers with build 3447 on Ubuntu 22.04:

GetLoadHandler
--------------

normal:

> 2023-12-14 15:49:06 DEBUG vsmserver: Scheduled load update of 127.0.0.1 at Thu Dec 14 15:49:46 2023

> cendio@ubuntu-22:~$ sudo tlctl load list
> AGENT      USERS  MEM  CPU  RATING
> ==================================
> 127.0.0.1     14  10%   0%   36.30

vsmagent stopped:

> cendio@ubuntu-22:~$ sudo tlctl load list
> AGENT      USERS  MEM  CPU  RATING
> ==================================
> 127.0.0.1      -    -    -    DOWN

> 2023-12-18 15:43:56 DEBUG vsmserver.session: User with uid 0 requested a socket
> 2023-12-18 15:44:19 WARNING vsmserver.loadinfo: Error getting load from agent 127.0.0.1: [Errno 111] Connect call failed ('127.0.0.1', 904)
> 2023-12-18 15:44:19 WARNING vsmserver.loadinfo: Marking agent 127.0.0.1 as down

VerifySessionsHandler
---------------------

normal:

> 2023-12-14 15:50:15 DEBUG vsmserver.session: Doing periodic session verification

vsmagent stopped:

> 2023-12-14 16:04:36 WARNING vsmserver: Error verifying sessions on agent 127.0.0.1: [Errno 111] Connect call failed ('127.0.0.1', 904)
> 2023-12-14 16:04:36 INFO vsmserver.session: Session 127.0.0.1:10 for cendio could not be verified. Marking as unreachable.

removed 'ss' binary:

* Login doesn't work
* Logs:

> 2023-12-18 14:58:32 ERROR vsmagent: Exception when running ss: [Errno 2] No such file or directory: 'ss'

'ss' writes to stderr:

* Login works
* Logs:

> 2023-12-18 15:40:05 DEBUG vsmserver.session: Doing periodic session verification
> 2023-12-18 15:40:05 WARNING vsmagent: ss wrote to stderr: woho

'ss' writes incorrect format:

* Login works
* Logs:

> 2023-12-18 15:42:11 WARNING vsmagent: unable to parse ss line: 'något konstigt'


UnbindPortsHandler
------------------

normal:

> 2023-12-14 15:47:46 DEBUG vsmserver: Requesting VSM Agent 127.0.0.1 to unbind ports for cendio's display 10

ReqSessionHandler
-----------------

normal:

> 2023-12-14 15:47:45 INFO vsmserver.session: User with uid 1000 (cendio) requested a new session
> 2023-12-14 15:47:45 DEBUG vsmserver: User cendio requesting session on cluster 'Default'
> 2023-12-14 15:47:45 DEBUG vsmserver: Contacting VSM Agent 127.0.0.1 to request a new session for cendio
> 2023-12-14 15:47:46 INFO vsmagent.session: Verified connectivity to newly started Xvnc for cendio
> 2023-12-14 15:47:46 INFO vsmserver.session: Session 127.0.0.1:10 created for user cendio

TerminateSessionHandler
-----------------------

normal:

> 2023-12-14 15:53:09 INFO vsmserver.session: User with uid 0 requested termination of 127.0.0.1:10
> 2023-12-14 15:53:17 DEBUG vsmserver.session: Writing active sessions to disk
> 2023-12-14 15:53:17 DEBUG vsmserver.session: Done writing sessions

vsmagent stopped:

> 2023-12-14 16:08:28 DEBUG vsmserver.session: User with uid 0 requested a socket
> 2023-12-14 16:08:28 DEBUG vsmserver.session: User with uid 0 requested a socket
> 2023-12-14 16:08:28 WARNING vsmserver: Error verifying sessions on agent 127.0.0.1: [Errno 111] Connect call failed ('127.0.0.1', 904)
Comment 112 Samuel Mannehed cendio 2023-12-18 16:05:37 CET
Missed to add the error cases for..

UnbindPortsHandler
------------------

Used 'nc -l 5000' to have a process to kill.

removed 'ss' binary:

> 2023-12-18 15:41:02 DEBUG vsmserver: Requesting VSM Agent 127.0.0.1 to unbind ports for cendio's display 10
> 2023-12-18 15:41:02 WARNING vsmagent: ss wrote to stderr: woho

'ss' writes to stderr:

> 2023-12-18 15:49:31 DEBUG vsmserver: Requesting VSM Agent 127.0.0.1 to unbind ports for cendio's display 10
> 2023-12-18 15:49:31 WARNING vsmagent: ss wrote to stderr: woho

'ss' writes in incorrect format:

> 2023-12-18 15:59:51 DEBUG vsmserver: Requesting VSM Agent 127.0.0.1 to unbind ports for cendio's display 10
> 2023-12-18 15:59:51 WARNING vsmagent: unable to parse ss line: 'något konstigt'
Comment 113 Pierre Ossman cendio 2023-12-18 16:09:05 CET
Tested core server XML-RPC handling using Fedora 39:

 * Server timeouts (request line, headers, body)
 * Server excessive data (request line, headers, body)
 * Server invalid data (request line, headers, body)
 * Server missing content-length

All of the above tested using both TCP and Unix socket.

Also tested:

 * Server crashes (crash in method handler, and exception on response future)
 * Server invalid response
Comment 114 Linn cendio 2023-12-19 09:24:48 CET
Tested GetCapabilities handling on RHEL 9:

 * The number of max sessions is set by /vsmserver/max_sessions_per_user in the conf as expected
 * The session chooser that pops up when logging in shows the line "New session...", as long as the max limit of sessions has not yet been reached
   - If the max limit has been reached, the line "New session..." is not shown
Comment 115 Adam Halim cendio 2023-12-19 09:25:27 CET
Noticed a bug while testing handler_sessionchange.py. There are some places where we forgot to remove some code in r40504 which leads to the following crash in some cases:

> 2023-12-19 09:20:12 ERROR vsmserver: Exception in callback ServerProtocol._call_callback(<str>, <str>)
> 2023-12-19 09:20:12 ERROR vsmserver: handle: <Handle ServerProtocol._call_callback(<str>, <str>)>
> 2023-12-19 09:20:12 ERROR vsmserver: ----------------------------------------
> 2023-12-19 09:20:12 ERROR vsmserver: Traceback (most recent call last):
> 2023-12-19 09:20:12 ERROR vsmserver:   File "/usr/lib64/python3.4/asyncio/events.py", line 125, in _run
> 2023-12-19 09:20:12 ERROR vsmserver:     self._callback(*self._args)
> 2023-12-19 09:20:12 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 337, in _call_callback
> 2023-12-19 09:20:12 ERROR vsmserver:     self . _response = self . _callback ( methodname , * params )
> 2023-12-19 09:20:12 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 319, in _handle_method
> 2023-12-19 09:20:12 ERROR vsmserver:     ooO0oOO0 = oOOoO0o . handle ( method , params )
> 2023-12-19 09:20:12 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/handler_sessionchange.py", line 28, in handle
> 2023-12-19 09:20:12 ERROR vsmserver:     self . close ( )
> 2023-12-19 09:20:12 ERROR vsmserver: AttributeError: 'SessionChangeHandler' object has no attribute 'close'
> 2023-12-19 09:20:12 ERROR vsmserver: ----------------------------------------
Comment 116 Pierre Ossman cendio 2023-12-19 10:20:07 CET
Found another issue. If there is an exception making the XML-RPC call, then that isn't reported back to the caller properly:

> 2023-12-19 10:19:22 ERROR vsmserver: Traceback (most recent call last):
> 2023-12-19 10:19:22 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 97, in _write_request
> 2023-12-19 10:19:22 ERROR vsmserver:     ii1Ii1I = dumps ( params , methodname ) . encode ( self . encoding )
> 2023-12-19 10:19:22 ERROR vsmserver:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 2023-12-19 10:19:22 ERROR vsmserver:   File "/usr/lib64/python3.12/xmlrpc/client.py", line 981, in dumps
> 2023-12-19 10:19:22 ERROR vsmserver:     data = m.dumps(params)
> 2023-12-19 10:19:22 ERROR vsmserver:            ^^^^^^^^^^^^^^^
> 2023-12-19 10:19:22 ERROR vsmserver:   File "/usr/lib64/python3.12/xmlrpc/client.py", line 514, in dumps
> 2023-12-19 10:19:22 ERROR vsmserver:     dump(v, write)
> 2023-12-19 10:19:22 ERROR vsmserver:   File "/usr/lib64/python3.12/xmlrpc/client.py", line 536, in __dump
> 2023-12-19 10:19:22 ERROR vsmserver:     f(self, value, write)
> 2023-12-19 10:19:22 ERROR vsmserver:   File "/usr/lib64/python3.12/xmlrpc/client.py", line 540, in dump_nil
> 2023-12-19 10:19:22 ERROR vsmserver:     raise TypeError("cannot marshal None unless allow_none is enabled")
> 2023-12-19 10:19:22 ERROR vsmserver: TypeError: cannot marshal None unless allow_none is enabled
> 2023-12-19 10:19:22 ERROR vsmserver: 
> 2023-12-19 10:19:22 ERROR vsmserver: The above exception was the direct cause of the following exception:
> 2023-12-19 10:19:22 ERROR vsmserver: 
> 2023-12-19 10:19:22 ERROR vsmserver: Traceback (most recent call last):
> 2023-12-19 10:19:22 ERROR vsmserver:   File "/usr/lib64/python3.12/asyncio/events.py", line 84, in _run
> 2023-12-19 10:19:22 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-12-19 10:19:22 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 441, in IiI1I
> 2023-12-19 10:19:22 ERROR vsmserver:     IiI111111i1 = IiI111I1 . call_method ( methodname , * params )
> 2023-12-19 10:19:22 ERROR vsmserver:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 2023-12-19 10:19:22 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 80, in call_method
> 2023-12-19 10:19:22 ERROR vsmserver:     self . _write_request ( methodname , * params )
> 2023-12-19 10:19:22 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 99, in _write_request
> 2023-12-19 10:19:22 ERROR vsmserver:     raise XMLMarshallingError ( "Error encoding XML-RPC call %s: %s" %
> 2023-12-19 10:19:22 ERROR vsmserver: thinlinc.vsm.xmlrpc.XMLMarshallingError: Error encoding XML-RPC call get_load: cannot marshal None unless allow_none is enabled
Comment 118 Pierre Ossman cendio 2023-12-19 10:54:15 CET
Above issue is now fixed. Verified on Fedora 39.
Comment 119 Alexander Zeijlon cendio 2023-12-19 10:55:00 CET
Tested ReqSessionHandler with server and agent installed on RHEL 9:

Used a simple xmlrpc-script to be able to send typical vsmserver-to-vsmagent-requests. This way, I was able to observe the behavior of the handler in different scenarios without having to manipulate the handler-code itself.

All code-paths in ReqSessionHandler that do not result in a successful start of a session are expected to log the failure and respond with failure status.

Results for the code-paths that should lead to a failure:

* Too old client (client_version missing in the request):
  - Note: Removed client version from a typical request.
> WARNING vsmagent: Invalid client capabilities: client_version missing
* User does not exist:
  - Note: Added a nonexistent user to an otherwise valid request.
> WARNING vsmagent: User cendioz does not exist on this system!
* No available display numbers:
  - Note: Made the available range contain only two display numbers, and made
    the corresponding Unix sockets busy with "nc -lU <path to socket>".
> WARNING vsmagent.session: Unable to find free display number when starting session for cendio. Reporting to VSM Server
* Unexpected OSError when trying to read from a unix socket:
  - Note: Made this work by creating a tmpfs mount point with the same name
    and location of the Unix socket that would be checked.
> WARNING vsmagent.session: Error checking X11 socket for display 10
* Unable to create home directory for user:
  - Note: Mounted a read only tmpfs on the home directory to ensure that root
    has no write access.
> WARNING vsmagent: Failed to create /home/cendio: [Errno 30] Read-only file system: '/home/cendio'
* Timeout when waiting for VNC-port to become available:
  - Note: Set timeout for this to zero.
> WARNING vsmagent.session: Xvnc start for username cendio failed (timeout waiting for vnc port)
> DEBUG3 vsmagent.session: Xvnc startup failure for cendio
* Unable to session process startup error:
  - Note: Passed an invalid argument to the process via xserver_args.
> WARNING tl-session: tl-xinit exited with status=1
> DEBUG3 vsmagent.session: Xvnc startup failure for cendio

Otherwise, the ReqSessionHandler creates a running session and returns its session data to the requester. This last part was also tested with the real client.
Comment 120 Pierre Ossman cendio 2023-12-19 12:46:00 CET
Tested core client XML-RPC handling using Fedora 39:

 * Unexpected connection close
 * Invalid parameters
 * Client excessive data (status line, headers, body)
 * Client invalid data (request line, headers, body)
 * Client missing content-length
 * Request rejected with "Go away!"
 * Connection refused
Comment 121 Pierre Ossman cendio 2023-12-19 16:04:48 CET
Checked load balancer on Fedora 39:

 * Correctly updates load every 40 seconds
 * Handles a connection error, and recovers when agent is available again
 * Dead sessions are handled
 * Penalty points are reduced every 40 seconds
Comment 123 Adam Halim cendio 2023-12-19 16:47:54 CET
(In reply to Adam Halim from comment #115)
This should now be fixed.

To test the behaviour on 4.15.0, I modified the XMLRPC-call to send the wrong parameters, which resulted in these messages being logged once per second:
Primary master:
> 2023-12-19 10:41:47 ERROR vsmserver.HA: No changetype key in sessionchange
> 2023-12-19 10:41:47 WARNING vsmserver.HA: Connection to lab-88:9000 closed before receiving response in request to update other HA node with session change (delete, cendio). Retrying later.
Secondary master:
> 2023-12-19 10:42:18 ERROR vsmserver.HA: No changetype key in sessionchange
> 2023-12-19 10:42:19 WARNING vsmserver.HA: Connection to lab-170:9000 closed before receiving response in request to update other HA node with session change (delete, cendio). Retrying later.
I did not get a session when connecting in 4.15.0.

I tried the same scenario using build r40665 and the behaviour was the following messages being printed once per second:
Primary:
> 2023-12-19 11:01:42 WARNING vsmserver.HA: Error updating other HA node: Request rejected by server: '500' 'Internal error\r\n'
Secondary:
> 2023-12-19 11:02:19 DEBUG vsmserver: Handling method 'sessionchange' from ('10.48.2.170', 984)
> 2023-12-19 11:02:19 ERROR vsmserver.HA: No changetype key in sessionchange
> 2023-12-19 11:02:19 ERROR vsmserver: Exception in callback ServerProtocol._call_callback(<str>, <str>)
> 2023-12-19 11:02:19 ERROR vsmserver: handle: <Handle ServerProtocol._call_callback(<str>, <str>)>
> 2023-12-19 11:02:19 ERROR vsmserver: ----------------------------------------
> 2023-12-19 11:02:19 ERROR vsmserver: Traceback (most recent call last):
> 2023-12-19 11:02:19 ERROR vsmserver:   File "/usr/lib64/python3.4/asyncio/events.py", line 125, in _run
> 2023-12-19 11:02:19 ERROR vsmserver:     self._callback(*self._args)
> 2023-12-19 11:02:19 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 337, in _call_callback
> 2023-12-19 11:02:19 ERROR vsmserver:     self . _response = self . _callback ( methodname , * params )
> 2023-12-19 11:02:19 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 319, in _handle_method
> 2023-12-19 11:02:19 ERROR vsmserver:     ooO0oOO0 = oOOoO0o . handle ( method , params )
> 2023-12-19 11:02:19 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/handler_sessionchange.py", line 28, in handle
> 2023-12-19 11:02:19 ERROR vsmserver:     self . close ( )
> 2023-12-19 11:02:19 ERROR vsmserver: AttributeError: 'SessionChangeHandler' object has no attribute 'close'
> 2023-12-19 11:02:19 ERROR vsmserver: ----------------------------------------
Unlike 4.15.0, I successfully got a session after logging in.

After the fix in r40666, the behaviour is these messages being printed once per second:
Primary:
> 2023-12-19 16:34:55 WARNING vsmserver.HA: Error updating other HA node: Bad parameters to XMLRPC method
Secondary:
> 2023-12-19 16:42:35 ERROR vsmserver.HA: No changetype key in sessionchange
> 2023-12-19 16:42:36 DEBUG vsmserver: Handling method 'sessionchange' from ('10.48.2.170', 1023)
I still get a session when logging in.

The traceback has been removed, and we get a better error message when things go wrong.
Comment 125 Pierre Ossman cendio 2023-12-20 14:52:09 CET
Tested basic server handling on Fedora 39:

 * vsmserver/vsmagent complains and stops if the port is used
 * Timeout handling was checked on comment 113
 * All three method multiplexers' error handling was checked in bug 8272
 * Handling of all the different methods is tested individually
   (see comments on this bug)
 * Verified check for low port for all relevant methods
 * Verified check for trusted address for all relevant methods
 * Verified check for root uid for all relevant methods
Comment 126 Samuel Mannehed cendio 2023-12-20 16:12:19 CET
Tested 'CreateUserSocket' with build 3451 on Ubuntu 22.04:

normal:

2023-12-20 15:59:38 DEBUG vsmserver: Handling method 'create_user_socket' from ('127.0.0.1', 58658)
2023-12-20 15:59:38 DEBUG vsmserver.session: User with uid 1001 requested a socket

incorrect user id (modified thinlinc-login to use a string instead of a uid):

2023-12-20 16:08:16 DEBUG vsmserver: Handling method 'create_user_socket' from ('127.0.0.1', 39390)
2023-12-20 16:08:16 WARNING vsmserver.session: create_user_socket: First argument (hej) not an integer

user doesn't exist (modified thinlinc-login to use the wrong uid):

2023-12-20 16:06:19 DEBUG vsmserver: Handling method 'create_user_socket' from ('127.0.0.1', 46438)
2023-12-20 16:06:19 DEBUG vsmserver.session: User with uid 12345 requested a socket
2023-12-20 16:06:19 WARNING vsmserver.session: create_user_socket: User with uid 12345 does not exist
Comment 127 Samuel Mannehed cendio 2023-12-20 16:14:33 CET
Tested 'GetLoadStatus' with build 3451 on Ubuntu 22.04:

normal (tlctl load list & client connect):

> 2023-12-20 16:13:16 DEBUG vsmagent: Handling method 'get_load' from ('127.0.0.1', 1023)
> 2023-12-20 16:13:39 DEBUG vsmagent: Handling method 'get_load' from ('10.48.2.145', 1023)

vsmagent stopped:

> 2023-12-14 16:05:21 WARNING vsmserver.loadinfo: Error getting load from agent 127.0.0.1: [Errno 111] Connect call failed ('127.0.0.1', 904)
> 2023-12-14 16:05:21 WARNING vsmserver.loadinfo: Marking agent 127.0.0.1 as down
Comment 128 Adam Halim cendio 2023-12-20 16:38:52 CET
Tested 'sessionchange' using build r40667 using SLES12 (Python 3.4.10)

I used a HA setup with two masters and two agents in the same clusters. I tested all three changetypes in the handler. All updates were synced between the masters and this was verified by comparing the persistent file on both servers.

* new
  I tested the first two branches in handle_new() and verified that they worked
  as intended. I had issues with the last branch due to always crashing when
  reaching it (bug 7683). This branch is probably very unlikely to
  reach in a real-world setting, and I used a modified client to provoke the
  server into reaching this branch.

* update
  Managed to get to all the branches here, everything worked correctly.

* delete
  Managed to get to all the branches here, everything worked correctly.

Also tested error handling in case the wrong parameters were sent to the handler.
Comment 129 Linn cendio 2023-12-20 16:48:32 CET
Tested the session store for HA:

 * Changes to sessions are automatically synced between the masters

 * If a HA node can't be reached because its hostname can't be found in /vsmserver/HA/nodes, this error is reported in the log:
> 2023-12-20 13:20:10 ERROR vsmserver.HA: Failed to do hostname lookup on node '<bad hostname>' from /vsmserver/HA/nodes
> 2023-12-20 13:20:10 ERROR vsmserver.HA: 0 matches when comparing hostnames from /vsmserver/HA/nodes with result of gethostname() - should be exactly one match
> 2023-12-20 13:20:10 ERROR vsmserver.HA: Can't determine address of other node safely.
> 2023-12-20 13:20:10 ERROR vsmserver.HA: STARTING WITH HA DISABLED!
 * When trying to update the secondary HA node, if that node can't be reached this warning will be shown in the log of the first HA node. It will try to connect to the other HA node every 1 second, and for each failed try this line will be shown in the logs.
> 2023-12-20 11:32:56 WARNING vsmserver.HA: Error updating other HA node: [Errno -2] Name or service not known
 * When specifying more than 2 HA nodes, this error is logged:
> 2023-12-20 13:36:43 ERROR vsmserver.HA: /vsmserver/HA/nodes contains more than two entries, which is not a supported configuration
 * If getting multiple matches for the same IP during the lookup, this line is logged. I triggered this by adding the same hostname in /vsmserver/HA/nodes twice:
> 2023-12-20 13:38:11 ERROR vsmserver.HA: Multiple matches when comparing IPs of HA nodes with result from connect()
* When one of the HA nodes cannot load active changes, the following output is shown in the log. To trigger this, I added jibberish to the file /var/lib/vsm/changes on the secondary node.
> # Log on the node with the bad file:
> 2023-12-20 15:56:02 ERROR vsmserver: Error loading HA changes database: Error unpacking /var/lib/vsm/changes (could not find MARK)
> 2023-12-20 15:56:02 ERROR vsmserver: HA changes database needs manual recovery.
> 
> # Log on the primary node:
> 2023-12-20 16:05:12 WARNING vsmserver.HA: Error updating other HA node: [Errno 111] Connect call failed ('10.48.2.149', 9000)
Comment 131 Alexander Zeijlon cendio 2023-12-21 09:45:30 CET
Tested ShadowSessionHandler on RHEL 9:

An attempt at shadowing another user, works as expected, given that the shadowing user has been added to the list of shadowers.

In the cases below, shadowing is unsuccessful (which is expected), and the vsmserver continues to process new requests:

* If mode is set to reject, the attempt at shadowing should be rejected even if the shadowing user is in the list of shadowers.
> WARNING vsmserver.shadow: Shadowing is disabled, rejecting request.
* uid of user to shadow missing in input.
> WARNING vsmserver.shadow: Rogue client providing clientsessinfo without uid as key
* Shadow user does not exist.
> WARNING vsmserver.session: shadow_session: User with uid 1234 does not exist
* Shadowed user does not exist.
> WARNING vsmserver.session: shadow_session: User with uid 1111 does not exist
* Shadow user is not in list of users and groups allowed to shadow others.
> WARNING vsmserver.shadow: User with uid 1000 does not have permission to shadow tester
* Attempt to shadow a session that no longer exists (before verify_session).
> WARNING vsmserver: User cendio tried to shadow vanished session
* Attempt to shadow a session without a normal state (before verify_session).
> WARNING vsmserver.session: Tried to shadow session 10.48.2.40:13 for tester with unknown state unreachable
* Attempt to shadow a session that no longer exists (after verify_session).
> WARNING vsmserver: User cendio tried to shadow vanished session
* Attempt to shadow a session without a normal state (after verify_session).
> WARNING vsmserver.session: Tried to shadow session 10.48.2.40:11 for tester with unknown state unreachable
* assertion that shadowing uid must be provided.
> 2023-12-21 09:21:47 ERROR vsmserver: Exception in callback ServerProtocol._call_callback(<str>, {...})
> 2023-12-21 09:21:47 ERROR vsmserver: handle: <Handle ServerProtocol._call_callback(<str>, {...}) created at /opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py:329>
> 2023-12-21 09:21:47 ERROR vsmserver: ----------------------------------------
> 2023-12-21 09:21:47 ERROR vsmserver: Object created at (most recent call last):
> 2023-12-21 09:21:47 ERROR vsmserver:   File "/opt/thinlinc/sbin/vsmserver", line 111, in o0OoOOo00OOO
> 2023-12-21 09:21:47 ERROR vsmserver:     iIIiii1iI . run_forever ( )
> 2023-12-21 09:21:47 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 601, in run_forever
> 2023-12-21 09:21:47 ERROR vsmserver:     self._run_once()
> 2023-12-21 09:21:47 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/base_events.py", line 1897, in _run_once
> 2023-12-21 09:21:47 ERROR vsmserver:     handle._run()
> 2023-12-21 09:21:47 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
> 2023-12-21 09:21:47 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-12-21 09:21:47 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/selector_events.py", line 811, in _read_ready
> 2023-12-21 09:21:47 ERROR vsmserver:     self._read_ready_cb()
> 2023-12-21 09:21:47 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/selector_events.py", line 868, in _read_ready__data_received
> 2023-12-21 09:21:47 ERROR vsmserver:     self._protocol.data_received(data)
> 2023-12-21 09:21:47 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 313, in data_received
> 2023-12-21 09:21:47 ERROR vsmserver:     self . _handle_request ( )
> 2023-12-21 09:21:47 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 329, in _handle_request
> 2023-12-21 09:21:47 ERROR vsmserver:     asyncio . get_event_loop ( ) . call_soon ( self . _call_callback , O0oo000o , * I1i )
> 2023-12-21 09:21:47 ERROR vsmserver: ----------------------------------------
> 2023-12-21 09:21:47 ERROR vsmserver: Traceback (most recent call last):
> 2023-12-21 09:21:47 ERROR vsmserver:   File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
> 2023-12-21 09:21:47 ERROR vsmserver:     self._context.run(self._callback, *self._args)
> 2023-12-21 09:21:47 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 337, in _call_callback
> 2023-12-21 09:21:47 ERROR vsmserver:     self . _response = self . _callback ( methodname , * params )
> 2023-12-21 09:21:47 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 380, in _handle_user_method
> 2023-12-21 09:21:47 ERROR vsmserver:     ooO0oOO0 = oOOoO0o . handle ( method , params , O0 )
> 2023-12-21 09:21:47 ERROR vsmserver:   File "/opt/thinlinc/modules/thinlinc/vsm/handler_shadowsession.py", line 26, in handle
> 2023-12-21 09:21:47 ERROR vsmserver:     assert None != uid , "ShadowSessionHandler.handle got uid=None, params: %s" % str ( params )
> 2023-12-21 09:21:47 ERROR vsmserver: AssertionError: ShadowSessionHandler.handle got uid=None, params: ({'agenthost': '10.48.2.40', 'client_ip': '10.47.1.50', 'client_params': {'capabilities': {'client_build': '3258', 'client_lang': 'en_US.UTF-8', 'client_platform': 'Linux', 'client_version': '4.15.0', 'localdrives': {'aliases': {}}, 'printer': {}, 'smartcard': {}, 'sound': {}}, 'hwaddr': 'E0:BE:03:15:47:1B'}, 'command': '', 'connectionstatus': 'connected', 'display': 13, 'fbsize_x': 1024, 'fbsize_y': 768, 'last_connection': 1703145121.049371, 'session_pid': 45885, 'sessionstart': 1703079380.753994, 'status': 'normal', 'termserv_hostname': 'lab-40.lkpg.cendio.se', 'uid': 1001},)
> 2023-12-21 09:21:47 ERROR vsmserver: ----------------------------------------
Comment 132 Pierre Ossman cendio 2023-12-21 13:26:33 CET
Tested 'reconnect_session' on Fedora 39:

 * Handles invalid uid
 * Handles that a non-existent session is given
 * Handles that an unreachable session is given
 * Handles that session is discovered to be terminated
 * Handles that session is discovered to be unreachable
 * Handles that 'unbind_ports' fails
 * Handles session reconnect scripts:
   * Timeout
   * Failure starting script
   * Data on stdout
   * Data on stderr
   * Non-zero return code
Comment 133 Samuel Mannehed cendio 2023-12-21 14:01:26 CET
Tested 'GetUserSessions' with build 3451 on Ubuntu 22.04:

normal:

> 2023-12-21 10:30:53 DEBUG vsmserver: Handling method 'get_user_sessions' on socket owned by uid 1001
> 2023-12-21 10:30:53 DEBUG vsmserver.session: User with uid 1001 requested a list of his/her sessions
> 2023-12-21 10:30:53 DEBUG vsmserver: GetUserSessionHandler: List of groups: [1001]

no such uid (modified vsmserver.py to pass the wrong uid):

> 2023-12-21 12:44:05 DEBUG vsmserver: Handling method 'get_user_sessions' on socket owned by uid 1001
> 2023-12-21 12:44:05 WARNING vsmserver: get_user_sessions: User with uid 12345 does not exist

shadowing:

> 2023-12-21 12:56:52 DEBUG vsmserver: Handling method 'get_user_sessions' on socket owned by uid 1001
> 2023-12-21 12:56:52 DEBUG vsmserver.shadow: User with uid 1001 requested a list of sessions for cendio
> 2023-12-21 12:56:52 DEBUG vsmserver: GetUserSessionHandler: List of groups: [1001]

shadowing a non existant user:

> 2023-12-21 13:57:30 DEBUG vsmserver: Handling method 'get_user_sessions' on socket owned by uid 1001
> 2023-12-21 13:57:30 ERROR vsmserver.shadow: User with uid 1001 requested sessions for non-existing user asf
Comment 134 Alexander Zeijlon cendio 2023-12-21 14:09:02 CET
Tested KillSessionHandler on RHEL 9:
(vsmserver and vsmagent on separate machines)

* Missing uid in session info.
> WARNING vsmserver.session: Rogue client providing clientsessinfo without uid as key
* The user running the session does not exist on the master machine.
> WARNING vsmserver.session: kill_session: User with uid 1002 does not exist
* Not permitted to kill a user's session.
  - Root is always able to kill sessions.
  - Users can kill their own sessions.
  Others:
> WARNING vsmserver.session: kill_session: User with uid 1000 tried to kill tester's session
* Attempt to kill a session that no longer exists.
> WARNING vsmserver.session: User cendio tried to kill vanished session
* Attempt to kill a session that is unreachable.
> INFO vsmserver.session: Session lab-40.lkpg.cendio.se:10 for alezetester is unreachable and cannot be terminated. Dropping session from database.
* Attempt to kill a session that has unknown status code.
> This state is currently unreachable since we at the moment only have
> states normal and unreachable.
* Timeout when waiting for vsmagent to terminate a session.
  - Note that this was achieved by setting timeout in vsmxmlrpccall to 0.
> WARNING vsmserver.session: Error terminating session lab-40.lkpg.cendio.se:10 for cendio: Operation timed out
* Exception when asking vsmagent to terminate a session.
  - Manually triggered an error by returning a ValueError-exception from the vsmagent machine.
> WARNING vsmserver.session: Error terminating session lab-40.lkpg.cendio.se:10 for cendio: Request rejected by server: '500' 'Internal error\r\n'
* Non-zero error code returned from vsmagent when asked to kill a session.
> WARNING vsmserver.session: Failed to kill session lab-40.lkpg.cendio.se:10 for cendio (status=4)

Everything behaves as expected.
Comment 135 Adam Halim cendio 2023-12-21 14:46:09 CET
Tested vsmxmlrpccall using build r40669 on SLES12 (Python 3.4.10)

Race conditions have previously been tested (comment #100).

* Hardcoded getaddrinfo() to return an empty list:
> 2023-12-21 10:52:25 ERROR vsmserver: Exception in callback ServerProtocol._callback_done(<Future>)
> 2023-12-21 10:52:25 ERROR vsmserver: handle: <Handle > ServerProtocol._callback_done(<Future>)>
> 2023-12-21 10:52:25 ERROR vsmserver: ----------------------------------------
> 2023-12-21 10:52:25 ERROR vsmserver:                      .
> 2023-12-21 10:52:25 ERROR vsmserver:                      .
> 2023-12-21 10:52:25 ERROR vsmserver:                      .
> 2023-12-21 10:52:25 ERROR vsmserver: OSError: getaddrinfo() returned empty list
> 2023-12-21 10:52:25 ERROR vsmserver: ----------------------------------------
* Tested passing a URL with invalid characters to call_method():
> 2023-12-21 10:44:30 ERROR vsmserver: Exception in callback ServerProtocol._callback_done(<Future>)
> 2023-12-21 10:44:30 ERROR vsmserver: handle: <Handle ServerProtocol._callback_done(<Future>)>
> 2023-12-21 10:44:30 ERROR vsmserver: ----------------------------------------
> 2023-12-21 10:44:30 ERROR vsmserver:                      .
> 2023-12-21 10:44:30 ERROR vsmserver:                      .
> 2023-12-21 10:44:30 ERROR vsmserver:                      .
> 2023-12-21 10:44:30 ERROR vsmserver: socket.gaierror: [Errno -2] Name or service not known
> 2023-12-21 10:44:30 ERROR vsmserver: ----------------------------------------
* Tested not sending a port to call_method():
> 2023-12-21 10:54:44 ERROR vsmserver: Exception in callback ServerProtocol._call_callback(<str>)
> 2023-12-21 10:54:44 ERROR vsmserver: handle: <Handle ServerProtocol._call_callback(<str>)>
> 2023-12-21 10:54:44 ERROR vsmserver: ----------------------------------------
> 2023-12-21 10:54:44 ERROR vsmserver:                    .
> 2023-12-21 10:54:44 ERROR vsmserver:                    .
> 2023-12-21 10:54:44 ERROR vsmserver:                    .
> 2023-12-21 10:54:44 ERROR vsmserver: ValueError: No port specified in 'http://lab-170:'
> 2023-12-21 10:54:44 ERROR vsmserver: ----------------------------------------
* Tested occupying ports 0-1023 using nc:
> 2023-12-21 11:00:23 ERROR vsmserver: Exception in callback ServerProtocol._callback_done(<Future>)
> 2023-12-21 11:00:23 ERROR vsmserver: handle: <Handle ServerProtocol._callback_done(<Future>)>
> 2023-12-21 11:00:23 ERROR vsmserver: ----------------------------------------
> 2023-12-21 11:00:23 ERROR vsmserver:                    .
> 2023-12-21 11:00:23 ERROR vsmserver:                    .
> 2023-12-21 11:00:23 ERROR vsmserver:                    .
> 2023-12-21 11:00:23 ERROR vsmserver: RuntimeError: No free port in range 0-1023
> 2023-12-21 11:00:23 ERROR vsmserver: ----------------------------------------
* Tested unexpected exception when binding socket (commented out continue in loop when EADDRINUSE):
> 2023-12-21 11:06:05 ERROR vsmserver: Exception in callback ServerProtocol._callback_done(<Future>)
> 2023-12-21 11:06:05 ERROR vsmserver: handle: <Handle ServerProtocol._callback_done(<Future>)>
> 2023-12-21 11:06:05 ERROR vsmserver: ----------------------------------------
> 2023-12-21 11:06:05 ERROR vsmserver:                  .
> 2023-12-21 11:06:05 ERROR vsmserver:                  .
> 2023-12-21 11:06:05 ERROR vsmserver:                  .
> 2023-12-21 11:06:05 ERROR vsmserver: OSError: [Errno 98] Address already in use
> 2023-12-21 11:06:05 ERROR vsmserver: ----------------------------------------
* Tested timeouts:
> 2023-12-21 11:02:50 ERROR vsmserver: Exception in callback ServerProtocol._callback_done(<Future>)
> 2023-12-21 11:02:50 ERROR vsmserver: handle: <Handle ServerProtocol._callback_done(<Future>)>
> 2023-12-21 11:02:50 ERROR vsmserver: ----------------------------------------
> 2023-12-21 11:02:50 ERROR vsmserver:                  .
> 2023-12-21 11:02:50 ERROR vsmserver:                  .
> 2023-12-21 11:02:55 ERROR vsmserver:                  .
> 2023-12-21 11:02:50 ERROR vsmserver: concurrent.futures._base.TimeoutError: Operation timed out
> 2023-12-21 11:02:50 ERROR vsmserver: ----------------------------------------
Comment 136 Pierre Ossman cendio 2023-12-21 15:48:31 CET
Tested 'new_session' on Fedora 39:

 * Handles invalid uid
 * Handles that it cannot get the user's groups
 * Handles that allowed_groups is empty
 * Handles that the user is not in allowed_groups
 * Handles that the user is in allowed_groups
 * Handles that the user already has too many sessions
 * Handles that there aren't enough licenses
 * Handles trying multiple agents
 * Handles that all agents fail
 * Handles that an agent reports failure creating a session
 * Handles failing to "ping" an agent
 * Handles an exception calling "req_session"
 * Handles that 'unbind_ports' fails
 * Handles session creation scripts:
   * Timeout
   * Failure starting script
   * Data on stdout
   * Data on stderr
   * Non-zero return code
Comment 138 Samuel Mannehed cendio 2023-12-22 13:06:24 CET
Acceptance criteria:
--------------------

> MUST:
> 
>  * The XML-RPC code no longer relies on the system asyncore or asynchat modules
Yes. After r40670, grepping in the entire source tree shows that asyncore or asynchat isn't used anywhere anymore.

> SHOULD:
> 
>  * The XML-RPC code uses asyncio terminology rather than asyncore such
We can't see any major issues here. It should be noted that our futures aren't consistently named. Sometimes they are called “future” and sometimes their names reflect the data in the result after it's finished. That is OK.

Otherwise, the names of our protocols and transports have names that reflect what they are.

>  * The XML-RPC code has no ThinLinc code that is redundant with asyncio functionality
Yes, we don't have much “glue” here anymore. Basically all the extra stuff has been removed in favor of using asyncio builtin methods as much as possible. For example, our delayed calls were replaced with “asyncio.call_later()” and friends.

>  * The XML-RPC code is converted to asyncio models and abstractions (e.g. protocols)
Yes, we now use Transports (for example lowxmlrpc.LowPortXMLRPCTransport) and Protocols (for example xmlrpc.ClientProtocol).

Note that we don't use the common high-level APIs like “await” and “async def”. Instead, we use futures and callbacks, which is what asyncio uses under the surface. This is because the high-level API has changed in recent Python versions. Asyncio in the oldest Python version we are targeting (3.4.4) has a syntax that isn't compatible with modern asyncio (Python 3.5+).

Conclusion:
-----------
This bug has now been verified thoroughly, things are working well. Closing.
Comment 139 Pierre Ossman cendio 2023-12-28 15:47:49 CET
Unfortunately, the task killing timeout does not work. asyncio.wait_for() will only send a second cancel to the remaining tasks, but it will continue waiting for them.

So a misbehaving task will prevent the service from stopping.

For some very odd reason, I'm getting this crash after a while in this state:

> Fatal Python error: Aborted
> 
> Current thread 0x00007f15a7891740 (most recent call first):
>   File "/usr/lib64/python3.12/selectors.py", line 468 in select
>   File "/usr/lib64/python3.12/asyncio/base_events.py", line 1913 in _run_once
>   File "/usr/lib64/python3.12/asyncio/base_events.py", line 618 in run_forever
>   File "/usr/lib64/python3.12/asyncio/base_events.py", line 651 in run_until_complete
>   File "/opt/thinlinc/sbin/vsmserver", line 154 in oO00o00OO
>   File "/opt/thinlinc/sbin/vsmserver", line 134 in OO0o0O0o0
>   File "/opt/thinlinc/sbin/vsmserver", line 107 in o0OoOOo00OOO
>   File "/opt/thinlinc/sbin/vsmserver", line 161 in <module>
Comment 141 Pierre Ossman cendio 2023-12-28 16:19:01 CET
Timeout now fixed. Tested with an evil coroutine. It gets an initial CancelledError, and the loop continues. After 0.5 seconds, the service proceeds to terminate, and we get an (expected) "Task was destroyed but it is pending!".

Note You need to log in before you can comment on or make changes to this bug.