Bug 8552 - Penalty points need to be reimagined
Summary: Penalty points need to be reimagined
Status: NEW
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Server (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.19.0
Assignee: Tobias
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-03-24 08:59 CET by Tobias
Modified: 2025-04-01 09:37 CEST (History)
0 users

See Also:
Acceptance Criteria:
MUST: * Faulty agents must be deprioritized SHOULD: * Add missing documentation


Attachments

Description Tobias cendio 2025-03-24 08:59:53 CET
The loadbalancer is being simplified as of bug 4429. Instead of a balancing scheme involving current agent load mixed with penalty points, users will be evenly distributed among agents.  Alternatively, distribution may follow a much simpler comprehensible approach, e.g. custom weighted distribution.

Thus the concept of penalty points in the context of agent rating and session creation is becoming obsolete.
Comment 1 Tobias cendio 2025-03-24 09:33:25 CET
Requesting a new session involves finding a list of suitable agents. So far, this was purely based on the current agent rating -- a score accounting for awarded penalty points and agent load, among other parameters.

Penalty points are adjusted in two situations:

(1) If there was an error in session creation, a set of penalty points are awarded, followed by an updated agent rating. The next best agent in line is then attempted -- taken from the original list, independent of adjusted rating. Penalty points awarded is configurable in hiveconf's /vsmserver/sessionfailure_penalty and defaults to 5.

(2) If session creation was successful, a singe penalty point is subtracted, followed by an updated agent rating. Penalty points subtracted is not configurable in hiveconf and is hardcoded to 1.
Comment 2 Tobias cendio 2025-03-24 10:13:41 CET
(In reply to Tobias from comment #1)

In the process of removing agent ratings, case (2) is unproblematic with a new simplified distribution. However it is not immediately obvious how to approach case (1) since we likely don't want to perpetually attempt a failing agent that might currently exhibit the least amount of users.
Comment 3 Tobias cendio 2025-03-24 11:50:43 CET
(In reply to Tobias from comment #2)

Some mechanism in place to mitigate an unwanted cycle of failed attempts is warranted, keeping in mind that we want to move away from complex unpredictable schemes.

An idea that came up was to attribute a cooldown to a failed agent, temporarily side-lining it. This could perhaps be a multiple of the configured timeout plus a small margin. Starting off with a low multiple -- 1 or 2 -- increasing every failed attempt, limited by the number of pooled agents. Once a session is successfully created, the cooldown is removed.
Comment 4 Tobias cendio 2025-03-24 11:57:01 CET
(In reply to Tobias from comment #3)

The suggested scheme in part hinges upon the loadbalancer not providing us with downed agents -- information that must be updated with sufficient frequency. I am unsure if the current loadbalancer pruning alters this.
Comment 5 Tobias cendio 2025-03-25 12:18:36 CET
In general, we want to achieve a framework where failing agents are deprioritized temporarily. The deprioritization should decay over time combined with some opportunity for the agent to prove its worth.

Currently, penalty points are awarded to failing agents with a certain decay over time, combined with point reduction upon successful session creation. These points are baked into the rating system, which in effect will deprioritize the agent in the cluster.

This is a solid setup with fairly predictable behavior, at least in terms of how penalty points are leveraged. Combined with the rating system, however – accounting for load and whatnot – it is a bit difficult to predict and document comprehensibly.

That being said, the rating system is scheduled for removal, which warrants the penalty system to be reimagined somewhat. The ”cooldown” approach mentioned in comment #3 is one. In the following comment, additional schemes that were discussed are outlined, each with their pros and cons.
Comment 6 Tobias cendio 2025-03-25 12:25:37 CET
1. Cooldown
-----------

Failing agents are rewarded with a cooldown timestamp which must be reached before it can be suggested by the load balancer again. Each successive fail increments the cooldown further, limited by the cluster size. Conversely, upon success the cooldown time resets to zero.

Pros for this approach includes not suggesting the agent at all during the cooldown period, which is good if there is some serious problem at hand. On the other hand, a small cluster with agents disabled for minutes may lead to bottleneck issues.

2. Penalty points added to sessions
-----------------------------------

Failing agents are rewarded penalty points, with a certain decay over time, combined with point reduction upon successful session creation. The load balancer accounts for points effectively as sessions and adjusts its initial suggested list of best agents accordingly.

One advantage with this scheme is that failing agents aren’t deprioritized too much which is good if it was a single occurrence. Still, a drawback is just that: broken agents must accumulate sufficient penalty before skedaddling, so we keep trying broken agents.

3. Freestanding penalty points
------------------------------

Failing agents are rewarded penalty points, with a certain decay over time, combined with point reduction upon successful session creation. The load balancer places any agents marked with penalties last in the list, sorted among each other based on number of penalty points.

This approach has the benefit of not actually disabling agents, avoiding agent scarcity and bottleneck situations. Compared with the other two approaches, this one achieves a sort of middle ground where agents are strongly deprioritized, but will technically still be available. A disadvantage is that failing agents are perhaps too penalized, in particular if it was an isolated incident.
Comment 7 Tobias cendio 2025-03-25 15:51:07 CET
(In reply to Tobias from comment #6)

Suggestion #2 should probably refer to users -- not sessions -- considering it is the number of ThinLinc users a future simplified load balancer should distribute evenly.
Comment 8 Tobias cendio 2025-04-01 08:32:10 CEST
Considering the approaches listed in the comment #6, number 2 appears to be a good
candidate, based on it being more lenient with first-offenders. It seems quite drastic to place a failing agent in the back after a single isolated incident. With a proper number of awarded penalty points, it will move to the back sufficiently fast, essentially achieving the same result while still hedging for flukes.

However, when thinking about how to implement this strategy in the current load balancer in which we haven’t moved away from the old rating system yet, it can seem a  bit clumsy. In part, this is because (as mentioned in comment #7) the simplified load balancer should distribute ThinLinc users evenly. Currently, all users (system users included) are accounted for when estimating an agent rating, which is a less meaningful quantity adding penalty points to, resulting in an unsatisfied result and probably a follow-up bug.

This problem exposes the fact that the handler for new sessions, penalty system, and the load balancing is quite coupled and should be more modular. We essentially strive towards two modular systems – the load balancing and handling of broken agents – that work independently to distribute users to the agents in a balanced and robust manner. That would facilitate making changes in either one, and perhaps most importantly expand the broken agents handling to include more ways of deprioritization than the crude penalty points.

An early set goal for the simplified load balancer was that it should be easy document and grasp by users. Despite that it is still perfectly fine for an advanced broken agents handling system acting on top of the load balancing. Users likely do not need nor want perfect insight into this system as long as the broad strokes are documented, e.g. ”the load balancer will be more reserved in picking faulty agents, based on observed problems”
Comment 9 Tobias cendio 2025-04-01 08:55:16 CEST
(In reply to Tobias from comment #8)

A separate bug 8561 was opened for the lack of modularization.
Comment 10 Tobias cendio 2025-04-01 09:22:36 CEST
Continuing the discussion from comment #8.

We likely want to move towards a load balancer that works with metadata surrounding session startup attempts. This would make it much more flexible in dealing with the different situations and can make smart decisions, contrary to the crude catch-all penalty point system currently in place.

For instance, the load balancer would like to make deprioritization decisions based on e.g.
    • timeouts
    • frequency of failures and successes
    • username (other user-specific metadata)
    • exactly why session startup failed

The list can be extended depending on how sophisticated we'd like the faulty agent handling to be. Achieving this requires us taking steps towards storing such metadata in the load info objects, and moving the onus of agent deprioritization to the load balancer; load info objects should be "natural" in this regard.
Comment 11 Tobias cendio 2025-04-01 09:37:53 CEST
(In reply to Tobias from comment #10)

Extensions of faulty agent handling should account for how some basic load distribution scenarios are affect. These may include, at least, combinations of

Load
    • Near-zero distribution
    • Halfway to limit
    • Skewed substantial distribution
    • Close to limit

Session startup frequency
    • Low
    • Medium
    • High

Note You need to log in before you can comment on or make changes to this bug.