Bug 4429 - The load balancer is overly complex
Summary: The load balancer is overly complex
Status: NEW
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Server (show other bugs)
Version: 3.4.0
Hardware: PC Unknown
: P2 Normal
Target Milestone: MediumPrio
Assignee: Peter Åstrand
URL:
Keywords: focus_loadbalancer
Depends on: 1174 4771 5268 8542 8543 8546
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-15 15:46 CEST by Aaron Sowry
Modified: 2025-03-27 09:40 CET (History)
3 users (show)

See Also:
Acceptance Criteria:


Attachments

Description Aaron Sowry cendio 2012-10-15 15:46:25 CEST
I'm sure there was already at least one bug for this, but I can't find anything relating to the general problem (although see bugs #2196 and #1174).

Our load balancing algorithm is not great. There are a number of problems:

1) Bogomips is a strange way to measure CPU performance. Throw in things like hyperthreading and it becomes even more problematic.

2) The general algorithm needs to be reviewed. Just because one server can support 4000 more sessions and another can only support 1000 more, doesn't mean that we should never start sessions on the weaker server. This also assumes that our rating figure is meaningful in this regard.

3) The existing_users_weight parameter is backwards, i.e. the higher the value the less each user matters.

4) Load is also affected by I/O, which isn't necessarily relevant to what we're checking
Comment 2 Aaron Sowry cendio 2012-10-23 16:47:01 CEST
Moving to NearFuture, so that we remember to revisit this after 4.0.0. See issue 13747.
Comment 3 Pierre Ossman cendio 2015-03-18 15:27:50 CET
(In reply to comment #0)
> 1) Bogomips is a strange way to measure CPU performance. Throw in things like
> hyperthreading and it becomes even more problematic.
> 

Bug 4771.

> 4) Load is also affected by I/O, which isn't necessarily relevant to what we're
> checking

As mentioned, bug 1174.
Comment 4 Pierre Ossman cendio 2015-03-18 15:31:36 CET
See also bug 5268. It has a rough prototype for changing the load balancer to simply pick the agent with the fewest number of users (not sessions, nor thinlinc users) on it. Note that it needs work as it doesn't consider varying machine capabilities, lots of logins in a short time, nor putting all sessions for a single user on the same agent.

(there is also still the fundamental question of what the basic principle of the load balancer should be)
Comment 7 Pierre Ossman cendio 2017-06-12 11:14:13 CEST
We've had some more internal discussion about this, and we've tried to summarise the issues and feedback we've gotten:

 * It's difficult to understand (and configure)
 * It can be overly lopsided if servers differ in (perceived) capacity
 * It doesn't spread risk
 * Some would like their own, arbitrary conditions for selecting agents

Our current system is based on the principle of giving every user as much resources as possible, but it assumes a) that the system measures everything relevant, b) the admin knows the resource usage and configures it accordingly.

Changes that could be made:

 * The systems tunes itself (addresses b)
 * Equal number of sessions (or users) per agent (addresses a, or balance risk instead of load)
 * Weighted number of sessions per agent (compromise between current model and simpler one)
 * Allow a user script to select the agent (let the customer solve the problem)
Comment 9 Aaron Sowry cendio 2019-08-21 02:20:08 CEST
A point of interest, perhaps: a comment from X2Go's config file about how their load values are calculated.

# The load factor calculation uses this algorithm:
#
#                  ( memAvail/1000 ) * numCPUs * typeCPUs
#    load-factor = -------------------------------------- + 1
#                        loadavg*100 * numSessions
#
# (memAvail in MByte, typeCPUs in MHz, loadavg is (system load *100 + 1) as
# positive integer value)
#
# The higher the load-factor, the more likely that a server will be chosen
# for the next to be allocated X2Go session.

We also have seen evidence that some customers are using cgroups to limit resource usage per-user. This may be worth thinking about with regards to load-balancing too, as it offers a certain degree of predictability about future resource consumption.
Comment 13 Pierre Ossman cendio 2020-10-09 09:56:20 CEST
Also note bug 284, which may or may not be relevant depending on what happens here.
Comment 17 Peter Wirdemo 2022-01-30 11:33:21 CET
I don't think that one "load balancer strategy" will fit all customers.

You could implement a couple of different and let the customer choose which to use.
Implement a variable loadbalance_strategy in vsmserver.hconf.
This could contain a single or a list of strategies

loadbalance_strategy=usercount,default

If the "best" agents have the same usercount, use the default strategy to select the best agent among these equal agents.

The loadbalance_strategy could be default(The current version), X2GO, usercount, memusage or others "standards" like round-robin, least-connection, source-ip-hash,load

As a bonus I would like to be able to add "custom" where the customer supplies a script(s) to be executed on agents. 

loadbalance_strategy=custom,default
loadbalance_custom=/usr/local/bin/myloadbalance.pl,/usr/local/bin/myotherbalancer.pl
Comment 21 Linn cendio 2025-01-07 17:20:02 CET
One of the most common issue reported is difficulty getting an even spread of users on agent machines. To combat this, we could to take a more simple approach to distributing the users. 

For example, if there are 3 agent machines (agentA-agentC) in the same cluster, and 4 users (user1-user4) log in at short succession, they would be distributed similarly as follows:

 user1 is assigned to agentA
 user2 is assigned to agentB
 user3 is assigned to agentC
 user4 is assigned to agentA
 
Note that this implementation would not take the performance of the machine into consideration, but only distribute the sessions evenly on the agents.

------------------------------

We asked a few people how the above approach would work in their existing setup, and below is a summary of their answers

Some people said this would work better than the current algorithm (BogoMips), since it would better align with their intended use of ThinLinc. A few of them had also tried to tweak the algorithm to get a more even spread of users.

Other people said their setup would be unaffected, as they didn't really use the loadbalancer but instead handled the load in another way. One way of doing this was using user groups to get more control of the users, and another way was to use the agents to just display applications run on a separate machine.   

For a few people, an even distribution of sessions would not solve their issues - the load of each user is too dependent on the type of applications they run, and if users running heavy applications end up on the same machine or not. However, spreading the users evenly would at least not make this worse.
Comment 22 Pierre Ossman cendio 2025-03-12 10:06:31 CET
We're going to start addressing this by making the load balancer a lot simpler, as described in some of the comments. I.e. a simple even spread of users over all agents.

This will be a first step, and then we'll evaluate what next steps might be worthwhile.

Note You need to log in before you can comment on or make changes to this bug.