Bug 8546 - Session monitoring is tightly coupled with load monitoring
Summary: Session monitoring is tightly coupled with load monitoring
Status: RESOLVED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Server (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.19.0
Assignee: Alexander Zeijlon
URL:
Keywords: prosaic, tobfa_tester
Depends on:
Blocks: 4429
  Show dependency treegraph
 
Reported: 2025-03-17 10:19 CET by Alexander Zeijlon
Modified: 2025-04-02 08:34 CEST (History)
1 user (show)

See Also:
Acceptance Criteria:
MUST: * Session monitoring should not depend on the load balancer. * Other modules, such as the license handler, should still get up-to-date session state when needed. COULD: * The vsmagent is meant to be stateless, and hence should not track tl-sessions it has started.


Attachments

Description Alexander Zeijlon cendio 2025-03-17 10:19:53 CET
In bug 4429, we are working on simplifying the load balancer.

As a consequence of this, we will no longer be able to piggyback on the load balancer for session monitoring, and we therefore need to decouple session monitoring from the load balancer.
Comment 1 Alexander Zeijlon cendio 2025-03-17 10:26:52 CET
Right now, the load balancer is responsible for purging dead sessions from the session store on the vsmserver. This occurs during a load balancer update, and this is the only place where dead sessions may be purged right now.
Comment 2 Alexander Zeijlon cendio 2025-03-17 10:47:18 CET
The vsmserver shouldn't count dead sessions, so we should still purge them, but we need to look at doing this somewhere else.
Comment 3 Alexander Zeijlon cendio 2025-03-18 09:15:27 CET
(In reply to Alexander Zeijlon from comment #1)
> Right now, the load balancer is responsible for purging dead sessions from
> the session store on the vsmserver. This occurs during a load balancer
> update, and this is the only place where dead sessions may be purged right
> now.

This is not true, it looks like there are two ways the server purges dead sessions.

One is as described above, where the load balancer purges dead sessions (PIDs). This is done as a side effect when the load balancer updates the load info for an agent. The load info object contains a list of dead session PIDs. The RPC-call is called "get_load".

The other way is via the verify sessions RPC-call, where the VerifySessionsHandler goes through an agent's list of known sessions and updates their state. If a session is not alive, it is then removed from the session store (purged) on the vsmserver.
Comment 4 Alexander Zeijlon cendio 2025-03-18 11:36:34 CET
The verify sessions call is triggered in a couple of different situations:

* When a user is logging in. All the user's sessions are verified.
* When a sysadmin is listing session details in tlwebadm. One session is verified.
* When syncing state between HA-nodes. One session is verified.
* When a sysadmin is terminating sessions with tlctl. All known sessions are verified.
* On regular intervals by the session store. All known sessions are verified.

The periodic check, where all sessions are verified, are performed once every 10 minutes.
Comment 5 Alexander Zeijlon cendio 2025-03-18 14:18:08 CET
The load balancer contributes to keeping the session storage up-to-date more often by doing as explained below:

1. The vsmagent runs a housekeeping()-function every 10 seconds, where PIDs of
   dead sessions are collected. This is likely a bit cheaper to do than the more
   "ambitious" check that verify sessions does.
2. This list of PIDs are passed to the load balancer in the get load RPC-call.
   The load info for each agent contains a list of dead sessions.
3. Dead sessions are removed (purged) from the session store at least every 40
   seconds, when the agents' load info are being updated.

It looks like we do not need this housekeeping routine in a purely functional sense since we are also managing dead sessions via the verify sessions call.

But it also looks like this "side effect" of the load balancer is making the vsmserver respond faster to changes in session states.

With only the verify sessions call present, there exists scenarios where the server would report the wrong number of running sessions for up to 10 minutes, while with the load balancer cleanup, this duration is reduced to a maximum of 40 seconds.
Comment 6 Alexander Zeijlon cendio 2025-03-18 14:52:30 CET
I just realized that this is much less of an edge case than I thought.

Dead sessions also include those that do a "normal" exit when a user logs out. It is therefore still important to keep the session database up-to-date more often than every 10 minutes.
Comment 7 Alexander Zeijlon cendio 2025-03-20 09:12:31 CET
A follow-up on comment 6. Purging sessions at the same time as the cluster load is updated might not be as impactful as I previously thought.

The dead_childs API is only able to provide updates for sessions (processes) that were started by the currently running vsmagent since the agent only actively tracks its own child processes. All sessions that persist past a vsmagent restart will therefore be managed only via the less frequently used verify_sessions API.

There are likely more sessions in a normal use case that are only managed via the verify_sessions API than I first thought. We for example know that vsmagent is restarted at least once a week when logs are rotated.
Comment 8 Alexander Zeijlon cendio 2025-03-20 09:21:54 CET
We therefore think that using only the verify_sessions API for keeping the vsmserver's session storage in sync should enough.

Let's therefore remove the dead_childs API. If we notice that the above assumption is wrong, we could look at e.g. verifying sessions more often.
Comment 10 Alexander Zeijlon cendio 2025-03-31 09:00:45 CEST
While we're at it, lets also cleanup som of the code that handles started tl-session subprocesses.

The list of currently running tl-session subprocess objects isn't really required for vsmagent to be able to keep track of running child process pids.

Instead of maintaining a state in vsmagent, we can utilize asyncio to queue periodic checks individually for each tl-session when it is started.
Comment 11 Alexander Zeijlon cendio 2025-03-31 09:10:40 CEST
This means that we also need to handle process termination a bit differently in TerminateSessionHandler() that is used when e.g. "tlctl session terminate" is called.

From the perspective of the TerminateSessionHandler, there is no longer any information about whether a PID belongs to a child process of vsmagent or not.

This means that we have to assume that all PIDs could be a child of the currently running vsmagent process, and then react appropriately if it is or is not.
Comment 15 Alexander Zeijlon cendio 2025-03-31 09:45:54 CEST
> MUST:
> * Session monitoring should not depend on the load balancer.
Session monitoring is no longer intermingled with the load info objects.
> * Other modules, such as the license handler, should still get up-to-date
>   session state when needed.
Here, we have diverged a bit from the original behavior where the load balancer would update session store after tl-sessions had exited. This is intentional, see comment 7 and comment 8.
> COULD:
> * The vsmagent is meant to be stateless, and hence should not track tl-sessions
>   it has started.
The vsmagent no longer holds a list of subprocesses. Monitoring of session PIDs are still performed.

Note You need to log in before you can comment on or make changes to this bug.