Bug 4429 - The load balancer is overly complex
Summary: The load balancer is overly complex
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Server (show other bugs)
Version: 3.4.0
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.19.0
Assignee: Samuel Mannehed
URL:
Keywords: hanli_tester, linma_tester, relnotes
Depends on: 1174 4771 5268 8542 8543 8546
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-15 15:46 CEST by Aaron Sowry
Modified: 2025-04-25 14:40 CEST (History)
6 users (show)

See Also:
Acceptance Criteria:
MUST * The load balancer must select the agent with the lowest number of ThinLinc users. * The load balancer choice must be made on up-to-date information. SHOULD * Old infrastructure and information should be cleared up and removed. * Remaining code should be refactored to fit the new load balancer. COULD * It would be nice if the load balancer could make the decision without having to contact the agents.


Attachments

Description Aaron Sowry cendio 2012-10-15 15:46:25 CEST
I'm sure there was already at least one bug for this, but I can't find anything relating to the general problem (although see bugs #2196 and #1174).

Our load balancing algorithm is not great. There are a number of problems:

1) Bogomips is a strange way to measure CPU performance. Throw in things like hyperthreading and it becomes even more problematic.

2) The general algorithm needs to be reviewed. Just because one server can support 4000 more sessions and another can only support 1000 more, doesn't mean that we should never start sessions on the weaker server. This also assumes that our rating figure is meaningful in this regard.

3) The existing_users_weight parameter is backwards, i.e. the higher the value the less each user matters.

4) Load is also affected by I/O, which isn't necessarily relevant to what we're checking
Comment 2 Aaron Sowry cendio 2012-10-23 16:47:01 CEST
Moving to NearFuture, so that we remember to revisit this after 4.0.0. See issue 13747.
Comment 3 Pierre Ossman cendio 2015-03-18 15:27:50 CET
(In reply to comment #0)
> 1) Bogomips is a strange way to measure CPU performance. Throw in things like
> hyperthreading and it becomes even more problematic.
> 

Bug 4771.

> 4) Load is also affected by I/O, which isn't necessarily relevant to what we're
> checking

As mentioned, bug 1174.
Comment 4 Pierre Ossman cendio 2015-03-18 15:31:36 CET
See also bug 5268. It has a rough prototype for changing the load balancer to simply pick the agent with the fewest number of users (not sessions, nor thinlinc users) on it. Note that it needs work as it doesn't consider varying machine capabilities, lots of logins in a short time, nor putting all sessions for a single user on the same agent.

(there is also still the fundamental question of what the basic principle of the load balancer should be)
Comment 7 Pierre Ossman cendio 2017-06-12 11:14:13 CEST
We've had some more internal discussion about this, and we've tried to summarise the issues and feedback we've gotten:

 * It's difficult to understand (and configure)
 * It can be overly lopsided if servers differ in (perceived) capacity
 * It doesn't spread risk
 * Some would like their own, arbitrary conditions for selecting agents

Our current system is based on the principle of giving every user as much resources as possible, but it assumes a) that the system measures everything relevant, b) the admin knows the resource usage and configures it accordingly.

Changes that could be made:

 * The systems tunes itself (addresses b)
 * Equal number of sessions (or users) per agent (addresses a, or balance risk instead of load)
 * Weighted number of sessions per agent (compromise between current model and simpler one)
 * Allow a user script to select the agent (let the customer solve the problem)
Comment 9 Aaron Sowry cendio 2019-08-21 02:20:08 CEST
A point of interest, perhaps: a comment from X2Go's config file about how their load values are calculated.

# The load factor calculation uses this algorithm:
#
#                  ( memAvail/1000 ) * numCPUs * typeCPUs
#    load-factor = -------------------------------------- + 1
#                        loadavg*100 * numSessions
#
# (memAvail in MByte, typeCPUs in MHz, loadavg is (system load *100 + 1) as
# positive integer value)
#
# The higher the load-factor, the more likely that a server will be chosen
# for the next to be allocated X2Go session.

We also have seen evidence that some customers are using cgroups to limit resource usage per-user. This may be worth thinking about with regards to load-balancing too, as it offers a certain degree of predictability about future resource consumption.
Comment 13 Pierre Ossman cendio 2020-10-09 09:56:20 CEST
Also note bug 284, which may or may not be relevant depending on what happens here.
Comment 17 Peter Wirdemo 2022-01-30 11:33:21 CET
I don't think that one "load balancer strategy" will fit all customers.

You could implement a couple of different and let the customer choose which to use.
Implement a variable loadbalance_strategy in vsmserver.hconf.
This could contain a single or a list of strategies

loadbalance_strategy=usercount,default

If the "best" agents have the same usercount, use the default strategy to select the best agent among these equal agents.

The loadbalance_strategy could be default(The current version), X2GO, usercount, memusage or others "standards" like round-robin, least-connection, source-ip-hash,load

As a bonus I would like to be able to add "custom" where the customer supplies a script(s) to be executed on agents. 

loadbalance_strategy=custom,default
loadbalance_custom=/usr/local/bin/myloadbalance.pl,/usr/local/bin/myotherbalancer.pl
Comment 21 Linn cendio 2025-01-07 17:20:02 CET
One of the most common issue reported is difficulty getting an even spread of users on agent machines. To combat this, we could to take a more simple approach to distributing the users. 

For example, if there are 3 agent machines (agentA-agentC) in the same cluster, and 4 users (user1-user4) log in at short succession, they would be distributed similarly as follows:

 user1 is assigned to agentA
 user2 is assigned to agentB
 user3 is assigned to agentC
 user4 is assigned to agentA
 
Note that this implementation would not take the performance of the machine into consideration, but only distribute the sessions evenly on the agents.

------------------------------

We asked a few people how the above approach would work in their existing setup, and below is a summary of their answers

Some people said this would work better than the current algorithm (BogoMips), since it would better align with their intended use of ThinLinc. A few of them had also tried to tweak the algorithm to get a more even spread of users.

Other people said their setup would be unaffected, as they didn't really use the loadbalancer but instead handled the load in another way. One way of doing this was using user groups to get more control of the users, and another way was to use the agents to just display applications run on a separate machine.   

For a few people, an even distribution of sessions would not solve their issues - the load of each user is too dependent on the type of applications they run, and if users running heavy applications end up on the same machine or not. However, spreading the users evenly would at least not make this worse.
Comment 22 Pierre Ossman cendio 2025-03-12 10:06:31 CET
We're going to start addressing this by making the load balancer a lot simpler, as described in some of the comments. I.e. a simple even spread of users over all agents.

This will be a first step, and then we'll evaluate what next steps might be worthwhile.
Comment 33 Samuel Mannehed cendio 2025-04-03 12:52:40 CEST
The load balancer has been reworked now to choose the agent with the lowest number of users. The old load information has been removed.

For now, the penalty points act like "extra users", but this might be redesigned as part of bug 8552.

The LoadInfoKeeper class could be completely removed. The "get_load" call was reworked to "ping" since it is still used for determining if an agent is down or not.

Without this information, there is no way for the master to know if it should send verify_sessions. The info about "down" agents is also used by the load balancer to place such agents at the end of the queue. This way we can avoid having to wait for timeouts.
Comment 36 Samuel Mannehed cendio 2025-04-04 13:41:02 CEST
> SHOULD
>
> * There should be a way to weigh differently equipped agents in different ways.
This was moved to bug 8566.
Comment 37 Samuel Mannehed cendio 2025-04-04 13:46:41 CEST
I tested build 3981 on a cluster with three CentOS 8 agents. The new system works as intended and will consistently distribute users evenly across the agents.

> MUST
> 
> * The load balancer must select the agent with the lowest number of ThinLinc users.
Yes. Note that it doesnt take multiple sessions into account. This was an active choice since we want to continue grouping multiple sessions for the same user on the same agent.
> * The load balancer choice must be made on up-to-date information.
Yes, it directly consults the sessionstore now instead of asking the agents for load information. The sessionstore is not instantly updated when a new session is created - it waits for the agent to report that the session was successfully created. Testing on tl.cendio.se shows that this takes less than a second.

> SHOULD
> 
> * Old infrastructure and information should be cleared up and removed.
Yes. The old load balancing information is removed, and classes facilitating this have either been deleted or reworked.
> * Remaining code should be refactored to fit the new load balancer.
Yes.

> COULD
> 
> * It would be nice if the load balancer could make the decision without having to contact the agents.
It can, note however that the sessionstore information will indirectly wait for information from the agents.
Comment 39 Hannes cendio 2025-04-09 13:39:42 CEST
Tested on Ubuntu 24.04 with server build 3983. Setup was 1 master with 3 agents.

> MUST
> * The load balancer must select the agent with the lowest number of ThinLinc users.
Yes. We tested logging in 10 users and saw that they were spread evenly. We also tested that an empty agent was prioritised while it had fewer users than the other agents. 

When multi-session is enabled, all sessions for one user were placed in the same agents, regardless of agent load.

> * The load balancer choice must be made on up-to-date information.
We tested logging in multiple users quickly (one login every second) and saw that users were distributed evenly.

> SHOULD
> * Old infrastructure and information should be cleared up and removed.
> * Remaining code should be refactored to fit the new load balancer.
Yes, webadmin and tlctl only contains relevant information. We also looked through the code, only relevant bits remain. 

> COULD
> * It would be nice if the load balancer could make the decision without having to contact the agents.
As stated in comment 37, the load balancer doesn't directly communicate with the agent.

---

Also checked that the logging of vsmserver and vsmagent looks good. Documentation and release notes have been updated as well.
Comment 40 Frida Flodin cendio 2025-04-14 12:39:27 CEST
We should probably get rid of the configuration variable "load_update_cycle" now that we don't monitor load in the same way.
Comment 42 Samuel Mannehed cendio 2025-04-15 12:27:51 CEST
(In reply to Frida Flodin from comment #40)
> We should probably get rid of the configuration variable "load_update_cycle"
> now that we don't monitor load in the same way.

Commit looks good, release note was also updated. I can't find any trace of "load_update_cycle" anymore.

The ping/down mechanism still works, tested jenkins build 3993 on Fedora 41.
Comment 43 Samuel Mannehed cendio 2025-04-22 11:32:49 CEST
When handler_newsession fails a ping when creating a new session, the agent isn't marked as down like before. Seems to have been a miss in r41679.
Comment 45 Samuel Mannehed cendio 2025-04-22 12:45:37 CEST
(In reply to Samuel Mannehed from comment #43)
> When handler_newsession fails a ping when creating a new session, the agent
> isn't marked as down like before. Seems to have been a miss in r41679.
Fixed now.
Comment 50 Samuel Mannehed cendio 2025-04-23 16:47:54 CEST
For the tester: the issue in comment 43 should be fixed, and log messages have been adjusted.
Comment 51 Tobias cendio 2025-04-25 14:40:34 CEST
Tested the latest ping issue fix using server build #4020 on Fedora41.

Followed these steps when testing that a failed ping sets an agent to DOWN:

    1. Verified using tlwebadm and tlctl load that the agent was up
    2. Stopped the vsmagent service
    3. Session startup attempt
    4. Verified the vsmserver log reporting agent being DOWN
    5. Verified agent being DOWN using tlwebadm and tlctl load

For testing that a successful ping reverts DOWN for an agent, I followed these steps:

    1. Verified agent being DOWN using tlwebadm and tlctl load
    2. Started the vsmagent service
    3. Session startup attempt
    4. Verified the vsmserver log reporting agent being back up again
    5. Verified using tlwebadm and tlctl load that the agent was up

In conclusion, if an agent ping fails during a new session startup, its status will indeed be changed to DOWN. Conversely, if an agent ping succeeds for a presently DOWN agent during a new session startup, its DOWN status is reverted. 

Checked the relevant code and the altered log messages – everything appears to be in order. Closing.

Note You need to log in before you can comment on or make changes to this bug.