Bug 8552 - Penalty points need to be reimagined
Summary: Penalty points need to be reimagined
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Server (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.19.0
Assignee: Tobias
URL:
Keywords: relnotes, samuel_tester
Depends on:
Blocks:
 
Reported: 2025-03-24 08:59 CET by Tobias
Modified: 2025-04-23 16:03 CEST (History)
1 user (show)

See Also:
Acceptance Criteria:
MUST: • Failing agents must be deprioritized by the load balancer • The new penalty system be designed with the new load balancer in mind • The penalty system must not be inert, since agent issues are likely persistent • Penalties must have some erosion mechanism that returns an agent to normal sorting within reasonable bounds • Scenarios handled well by the previous penalty system must remain well-handled SHOULD: • Failing agents should not inhibit users in scenarios with low session startup frequencies • Failing agents should not inhibit users in scenarios with high session startup frequencies • The load balancer should log decisions due to the penalty system • The penalty system should be documented at a high level, leaving out burdening details COULD: • Using tlctl or tlwebadm, admins should be able to see if agents are failing


Attachments

Description Tobias cendio 2025-03-24 08:59:53 CET
The loadbalancer is being simplified as of bug 4429. Instead of a balancing scheme involving current agent load mixed with penalty points, users will be evenly distributed among agents.  Alternatively, distribution may follow a much simpler comprehensible approach, e.g. custom weighted distribution.

Thus the concept of penalty points in the context of agent rating and session creation is becoming obsolete.
Comment 1 Tobias cendio 2025-03-24 09:33:25 CET
Requesting a new session involves finding a list of suitable agents. So far, this was purely based on the current agent rating -- a score accounting for awarded penalty points and agent load, among other parameters.

Penalty points are adjusted in two situations:

(1) If there was an error in session creation, a set of penalty points are awarded, followed by an updated agent rating. The next best agent in line is then attempted -- taken from the original list, independent of adjusted rating. Penalty points awarded is configurable in hiveconf's /vsmserver/sessionfailure_penalty and defaults to 5.

(2) If session creation was successful, a singe penalty point is subtracted, followed by an updated agent rating. Penalty points subtracted is not configurable in hiveconf and is hardcoded to 1.
Comment 2 Tobias cendio 2025-03-24 10:13:41 CET
(In reply to Tobias from comment #1)

In the process of removing agent ratings, case (2) is unproblematic with a new simplified distribution. However it is not immediately obvious how to approach case (1) since we likely don't want to perpetually attempt a failing agent that might currently exhibit the least amount of users.
Comment 3 Tobias cendio 2025-03-24 11:50:43 CET
(In reply to Tobias from comment #2)

Some mechanism in place to mitigate an unwanted cycle of failed attempts is warranted, keeping in mind that we want to move away from complex unpredictable schemes.

An idea that came up was to attribute a cooldown to a failed agent, temporarily side-lining it. This could perhaps be a multiple of the configured timeout plus a small margin. Starting off with a low multiple -- 1 or 2 -- increasing every failed attempt, limited by the number of pooled agents. Once a session is successfully created, the cooldown is removed.
Comment 4 Tobias cendio 2025-03-24 11:57:01 CET
(In reply to Tobias from comment #3)

The suggested scheme in part hinges upon the loadbalancer not providing us with downed agents -- information that must be updated with sufficient frequency. I am unsure if the current loadbalancer pruning alters this.
Comment 5 Tobias cendio 2025-03-25 12:18:36 CET
In general, we want to achieve a framework where failing agents are deprioritized temporarily. The deprioritization should decay over time combined with some opportunity for the agent to prove its worth.

Currently, penalty points are awarded to failing agents with a certain decay over time, combined with point reduction upon successful session creation. These points are baked into the rating system, which in effect will deprioritize the agent in the cluster.

This is a solid setup with fairly predictable behavior, at least in terms of how penalty points are leveraged. Combined with the rating system, however – accounting for load and whatnot – it is a bit difficult to predict and document comprehensibly.

That being said, the rating system is scheduled for removal, which warrants the penalty system to be reimagined somewhat. The ”cooldown” approach mentioned in comment #3 is one. In the following comment, additional schemes that were discussed are outlined, each with their pros and cons.
Comment 6 Tobias cendio 2025-03-25 12:25:37 CET
1. Cooldown
-----------

Failing agents are rewarded with a cooldown timestamp which must be reached before it can be suggested by the load balancer again. Each successive fail increments the cooldown further, limited by the cluster size. Conversely, upon success the cooldown time resets to zero.

Pros for this approach includes not suggesting the agent at all during the cooldown period, which is good if there is some serious problem at hand. On the other hand, a small cluster with agents disabled for minutes may lead to bottleneck issues.

2. Penalty points added to sessions
-----------------------------------

Failing agents are rewarded penalty points, with a certain decay over time, combined with point reduction upon successful session creation. The load balancer accounts for points effectively as sessions and adjusts its initial suggested list of best agents accordingly.

One advantage with this scheme is that failing agents aren’t deprioritized too much which is good if it was a single occurrence. Still, a drawback is just that: broken agents must accumulate sufficient penalty before skedaddling, so we keep trying broken agents.

3. Freestanding penalty points
------------------------------

Failing agents are rewarded penalty points, with a certain decay over time, combined with point reduction upon successful session creation. The load balancer places any agents marked with penalties last in the list, sorted among each other based on number of penalty points.

This approach has the benefit of not actually disabling agents, avoiding agent scarcity and bottleneck situations. Compared with the other two approaches, this one achieves a sort of middle ground where agents are strongly deprioritized, but will technically still be available. A disadvantage is that failing agents are perhaps too penalized, in particular if it was an isolated incident.
Comment 7 Tobias cendio 2025-03-25 15:51:07 CET
(In reply to Tobias from comment #6)

Suggestion #2 should probably refer to users -- not sessions -- considering it is the number of ThinLinc users a future simplified load balancer should distribute evenly.
Comment 8 Tobias cendio 2025-04-01 08:32:10 CEST
Considering the approaches listed in the comment #6, number 2 appears to be a good
candidate, based on it being more lenient with first-offenders. It seems quite drastic to place a failing agent in the back after a single isolated incident. With a proper number of awarded penalty points, it will move to the back sufficiently fast, essentially achieving the same result while still hedging for flukes.

However, when thinking about how to implement this strategy in the current load balancer in which we haven’t moved away from the old rating system yet, it can seem a  bit clumsy. In part, this is because (as mentioned in comment #7) the simplified load balancer should distribute ThinLinc users evenly. Currently, all users (system users included) are accounted for when estimating an agent rating, which is a less meaningful quantity adding penalty points to, resulting in an unsatisfied result and probably a follow-up bug.

This problem exposes the fact that the handler for new sessions, penalty system, and the load balancing is quite coupled and should be more modular. We essentially strive towards two modular systems – the load balancing and handling of broken agents – that work independently to distribute users to the agents in a balanced and robust manner. That would facilitate making changes in either one, and perhaps most importantly expand the broken agents handling to include more ways of deprioritization than the crude penalty points.

An early set goal for the simplified load balancer was that it should be easy document and grasp by users. Despite that it is still perfectly fine for an advanced broken agents handling system acting on top of the load balancing. Users likely do not need nor want perfect insight into this system as long as the broad strokes are documented, e.g. ”the load balancer will be more reserved in picking faulty agents, based on observed problems”
Comment 9 Tobias cendio 2025-04-01 08:55:16 CEST
(In reply to Tobias from comment #8)

A separate bug 8561 was opened for the lack of modularization.
Comment 10 Tobias cendio 2025-04-01 09:22:36 CEST
Continuing the discussion from comment #8.

We likely want to move towards a load balancer that works with metadata surrounding session startup attempts. This would make it much more flexible in dealing with the different situations and can make smart decisions, contrary to the crude catch-all penalty point system currently in place.

For instance, the load balancer would like to make deprioritization decisions based on e.g.
    • timeouts
    • frequency of failures and successes
    • username (other user-specific metadata)
    • exactly why session startup failed

The list can be extended depending on how sophisticated we'd like the faulty agent handling to be. Achieving this requires us taking steps towards storing such metadata in the load info objects, and moving the onus of agent deprioritization to the load balancer; load info objects should be "natural" in this regard.
Comment 11 Tobias cendio 2025-04-01 09:37:53 CEST
(In reply to Tobias from comment #10)

Extensions of faulty agent handling should account for how some basic load distribution scenarios are affect. These may include, at least, combinations of

Load
    • Near-zero distribution
    • Halfway to limit
    • Skewed substantial distribution
    • Close to limit

Session startup frequency
    • Low
    • Medium
    • High
Comment 12 Tobias cendio 2025-04-09 11:59:36 CEST
If innocuous reasons behind a fail – such as load-based timeouts or something similarly temporary – is relatively common, one can imagine an inertia in the penalty system to be reasonable since we perhaps don’t want to severely punish first-offenders. This inertia could take the form of lenient punishment at first, followed by exponentially harsher.

On the other hand, perhaps more serious problems behind fails are overwhelmingly common. Maybe some agent network problems or something arbitrary that isn’t particularly temporary in nature. If we’re living in that world, then an instant categorical penalty void of inertia is preferred, since we want to separate this agent ASAP and avoid ruining user experiences. This categorical penalty could take the form of being placed in a second sorting bucket, such that the load balancer would in orders of

    1. agent being down
    2. marked by penalty
    3. ThinLinc users.

Perhaps – as a first layer of complexity – the second bucket doesn’t even require internal sorting, apart from the regular user-based sorting. If we do seek some internal sorting as an additional layer of complexity, perhaps sorting by fail timestamps is reasonable, since the latest failing agent is likely to still be the most sketchy agent among all the penalty-marked agents.

How and to what rate these penalized agents relax back into a normal state from the penalty category is another question.
Comment 13 Tobias cendio 2025-04-09 16:13:38 CEST
(In reply to Tobias from comment #12)

A small comment regarding the second paragraph: it is likely not network problems that would cause an agent to fail out of the blue, since that effect would already have been caught by the ping mechanism, which would've labeled it as 'down'.

However, there could be other serious problems at hand inhibiting sessions from being established, such as issues with the desktop environment.
Comment 14 Tobias cendio 2025-04-09 16:36:10 CEST
Having discussed different approaches with varying levels of complexity, we decided to go forward with the simple approach of placing single-offending non-downed agents in a second sorting bucket, analogous to how downed agents are presently treated. The result would be as laid out in comment #12, i.e. 3 buckets in order, with regular user-based internal sorting. Suggested then to the master would be one unified whole list of agents.

So what would the mechanism be that returns failed agents to the first bucket? For this, the simple approach of waiting 3 minutes was chosen, which is similar to how long it takes presently for 5 penalty points (awarded from 1 fail) to decay – 1 penalty point per 40 seconds.
Comment 16 Tobias cendio 2025-04-10 11:26:39 CEST
> MUST:
> * Faulty agents must be deprioritized
✅ They are sorted secondarily when having failed.
> * The penalty system must not be inert since fails are likely caused by persistent issues
✅ The penalty system instantly places failing agents in the secondary sort, and does not successively deprioritize as previously.
> * Penalties must have some erosion mechanism that returns an agent to normal sorting within reasonable bounds
✅ If the failure happened more than 180 seconds ago, it will be ignored. Failed agents that manage to start a session will have their failure stamp removed.
> * Faulty agents must not cause serious disturbances during low or high login frequencies
❌ I realized when reading this criterion that while our solution works pretty good for scenarios with high login frequencies, a perpetually failing agent will be a recurring nuisance in scenarios with low login frequencies.
> SHOULD:
> * The penalty system should be able to be toggled off
❌  Not done.
> * The load balancer should log decisions due to the penalty system
❌  Not done.
> * The penalty system should be documented at a high level, leaving burdening details
❌  Not done.
Comment 17 Tobias cendio 2025-04-10 11:35:55 CEST
(In reply to Tobias from comment #16)
> > * Faulty agents must not cause serious disturbances during low or high login frequencies
> ❌ I realized when reading this criterion that while our solution works
> pretty good for scenarios with high login frequencies, a perpetually failing
> agent will be a recurring nuisance in scenarios with low login frequencies.

Perhaps a series of failures should be stored, instead of simply saving the latest timestamp. This way we can determine if the agent is a repeating offender and use this information to deprioritize it even further somehow.
Comment 19 Tobias cendio 2025-04-10 15:01:04 CEST
Removed the SHOULD criterion "The penalty system should be able to be toggled off" as that felt beyond the scope and probably simply not something we want to offer.

Added the SHOULD criterion "Using tlctl or tlwebadm, admins should be able to see if agents are failing". This stems from that repeatedly failing agents must be deprioritized a relatively long time (hours) to mitigate recurring failing agents during scenarios with low login frequency -- see corresponding MUST criterion.
Comment 23 Tobias cendio 2025-04-11 19:26:04 CEST
Reworded the MUST acceptance criterion 

"Faulty agents must not cause serious disturbances during low or high login frequencies"

into

"Faulty agents must not cause serious disturbances during low or high session startup frequencies"

since 'login' does not necessarily imply new session startup.
Comment 24 Tobias cendio 2025-04-14 09:32:21 CEST
An additional deprioritization tier was added for agents that
consecutively fail session startup beyond a threshold number.

This penalty is meant to target perpetually failing agents that are only
tried once in a while. Such agents are repeatedly given enough time to
return from the first tier of deprioritization, only to fail once
more. Consider for instance scenarios with low session startup
frequencies, say once or twice an hour.

There's a question whether or not this second deprioritization tier
should account for session startup failures arbitrarily far back in
time, in so far as they are consecutive. Another option is to only count
failures that fall within a time window.

I think there are good arguments for the former suggestion. Let's say
that an agent have a history of consecutive session startup failures,
and quite some time passes before it is tried again. If it proves to
fail once more, the original problem likely persists, and harsher
deprioritization is warranted. Moreover, this approach hedges better for
very low session startup frequencies.
Comment 25 Tobias cendio 2025-04-14 11:09:45 CEST
(In reply to Tobias from comment #24)

To clarify one thing: the time aspect brought up at the end refers to how long back in time one should account for failures.

It still stands that agents have to have failed within the last 3 hours to be placed in this second tier of deprioritization.
Comment 26 Tobias cendio 2025-04-14 11:13:25 CEST
Agents are now suggested by the load balancer for session startup
sorted by 4 tiers of prioritization (all tiers are internally
sorted by number of users):

1. Healthy agents

   • No registered failures within the last 3 minutes.

   • No registered 5 consecutive failures separated arbitrarily in time,
     last one being within the last 3 hours

2. Failed agents

   • At least one registered failure within the last 3 minutes

3. Consecutively failed agents

   • At least 5 registered consecutive failures separated arbitrarily in
     time, last one being within the last 3 hours

4. Downed agents

   • Doesn't respond to ping
Comment 27 Tobias cendio 2025-04-15 10:43:30 CEST
During some internal discussions, there’s been concerns that the current penalty system implementation is based on too broad assumptions. It may be attempting to hedge for scenarios that are difficult to predict with sufficient accuracy.

Based on that, the complexity of the penalty system will be reduced a bit, in particular the long penalties will be removed, leaving only a short term penalty for failing agents. Specifically, failing agents will be placed in the second sorting tier for 5 minutes, renewed every fail.

Remaining are the 3 tiers of priority: healthy, failed, and downed.
Comment 28 Tobias cendio 2025-04-15 10:53:25 CEST
(In reply to Tobias from comment #27)

Regarding long penalties in general: an observation surfaced during discussions that such treatments may have an unhelpful facet, as they might hide the problems a bit too extensively from sysadmins. Perhaps sysadmins would be grateful for frequent failure reports from the users, instead of letting problematic agents persist for hours or days.
Comment 35 Tobias cendio 2025-04-15 16:19:03 CEST
> MUST:
> • Failing agents must be deprioritized by the load balancer
✅ Failed agents are placed in a second sorting tier and will be deprioritized in favor of non-failed agents.
> • The new penalty system must be designed with the new load balancer in mind
✅ The new penalty system was designed to better fit the new simplified load balancer, as opposed to simply adding penalty points to agents sorted by number of users.
> • The penalty system must not be inert, since agent issues are likely persistent
✅ The penalty system instantly places failing agents in the secondary sorting tier. Does not progressively deprioritize as previously.
> • Penalties must have some erosion mechanism that returns an agent to normal sorting within reasonable bounds
✅ Agents will return to normal sorting after 5 minutes without any failures.
> • Scenarios handled well by the previous penalty system must remain well-handled
✅ The previous penalty system handled scenarios with high session startup frequencies moderately well. This has been improved upon by more decidedly penalizing failing agents, as opposed to progressively.
> SHOULD:
> • Failing agents should not inhibit users in scenarios with low session startup frequencies
❌ Agents with persistent issues are not slapped with long term punishments, and will thus always return to bother users in scenarios with low session startup frequencies.
> • Failing agents should not inhibit users in scenarios with high session startup frequencies
✅ Since the load balancer instantly places any failed agents at the back, users in scenarios with high session startup frequencies won't be particularly inhibited.
> • The load balancer should log decisions due to the penalty system
✅ Session startup results are logged, if relevant for future decisions taken by the load balancer.
> • The penalty system should be documented at a high level, leaving out burdening details
✅ It is mentioned briefly in the TAG.
> COULD:
> • Using tlctl or tlwebadm, admins should be able to see if agents are failing
❌ Considering the short time of the deprioritization, this wouldn't add much to what the log can provide.
Comment 36 Tobias cendio 2025-04-15 16:26:42 CEST
(In reply to Tobias from comment #35)
> > SHOULD:
> > • Failing agents should not inhibit users in scenarios with low session startup frequencies
> ❌ Agents with persistent issues are not slapped with long term punishments,
> and will thus always return to bother users in scenarios with low session
> startup frequencies.

Although this will indeed occur, it is not entirely settled if we view it as a bug or a feature. As mentioned in comment #28, excessively hiding problems could be unhelpful to sysadmins and the cluster as a whole.
Comment 39 Samuel Mannehed cendio 2025-04-23 16:03:49 CEST
Works well. The commits, release notes and documentation look good. I tested using build 4011 on a cluster consisting of 3 agents running CentOS 8.

> MUST:
> • Failing agents must be deprioritized by the load balancer
Failing agents could be either DOWN or have failed to start a new session.

By turning off the vsmagent service, I could demonstrate an agent being DOWN. In this state the agent is furthest deprioritized and the following is logged:
> 2025-04-23 15:05:15 WARNING vsmserver: Error checking if agent lab-44.lkpg.cendio.se is alive: [Errno 111] Connect call failed ('10.48.2.44', 904)
> 2025-04-23 15:05:15 WARNING vsmserver: Marking agent lab-44.lkpg.cendio.se as down
Even when the other agents had more users, they were used before the DOWN agent was tried.
By replacing /opt/thinlinc/libexec/Xvnc with a bash script that sleeps for 50 seconds I could demonstrate an agent failing to create a session. In this state the agent was properly deprioritized and the following was logged:
> 2025-04-23 15:01:53 WARNING vsmserver.loadinfo: Agent lab-44.lkpg.cendio.se failed to start session. Will be less prioritized by the load balancer for 5 minutes
Even when the other agents had more users, they were used before the failing agent was tried.

> MUST:
> • The new penalty system be designed with the new load balancer in mind
Yes, it suits well. It will place agents in three different "buckets" where none is off-limits for new sessions. The first bucket with the highest priority consists of working agents. The second bucket consists of agents which respond to pings from the master, but have recently failed to start a new session. The third, and least prioritized bucket, consists of agents which do not respond to pings from the master.

> MUST:
> • The penalty system must not be inert, since agent issues are likely persistent
It is. Agents which fail to respond to a ping or to create a session are instantly placed in the less prioritized buckets.

> MUST:
> • Penalties must have some erosion mechanism that returns an agent to normal sorting within reasonable bounds
Yes, they do. The master will indefinitely continue trying to ping DOWN agents, every 40 seconds, as soon as they respond they will be marked as up again. Agents which failed to start a new session will be deprioritized for 5 minutes. I verified that after ~10 minutes, an agent that was earlier deprioritized was again selected for a new session.

> MUST:
> • Scenarios handled well by the previous penalty system must remain well-handled
Yes. The scenarios I could think of were handled well.

> SHOULD:
> • Failing agents should not inhibit users in scenarios with low session startup frequencies
Well, they might with the current design. But see comment 36 for an explanation.

> SHOULD:
> • Failing agents should not inhibit users in scenarios with high session startup frequencies
The current design works well in such scenarios. An agent that failed to create a new session will not be tried for 5 minutes, given that other agents work.

> SHOULD:
> • The load balancer should log decisions due to the penalty system
Yes, warnings to the log are printed when an agent becomes marked as DOWN or is deprioritized due to a failed session start.
When an agent no longer is deprioritized, that state change is not logged, but I don't view that as a problem.

> SHOULD:
> • The penalty system should be documented at a high level, leaving out burdening details
Yep, the TAG says: "An agent failing to start a session will be less prioritized by the load balancer for 5 minutes. The agent will still be available for session startup — only less prioritized."

> COULD:
> • Using tlctl or tlwebadm, admins should be able to see if agents are failing
Only agents which are DOWN are marked as such in tlctl or web admin. An agent that is deprioritized due to a failed session-start is not marked differently in tlctl or web admin. But thats fine since its only a temporary state for 5 minutes.

All good!

Note You need to log in before you can comment on or make changes to this bug.