I'm sure there was already at least one bug for this, but I can't find anything relating to the general problem (although see bugs #2196 and #1174).
Our load balancing algorithm is not great. There are a number of problems:
1) Bogomips is a strange way to measure CPU performance. Throw in things like hyperthreading and it becomes even more problematic.
2) The general algorithm needs to be reviewed. Just because one server can support 4000 more sessions and another can only support 1000 more, doesn't mean that we should never start sessions on the weaker server. This also assumes that our rating figure is meaningful in this regard.
3) The existing_users_weight parameter is backwards, i.e. the higher the value the less each user matters.
4) Load is also affected by I/O, which isn't necessarily relevant to what we're checking
Moving to NearFuture, so that we remember to revisit this after 4.0.0. See issue 13747.
(In reply to comment #0)
> 1) Bogomips is a strange way to measure CPU performance. Throw in things like
> hyperthreading and it becomes even more problematic.
> 4) Load is also affected by I/O, which isn't necessarily relevant to what we're
As mentioned, bug 1174.
See also bug 5268. It has a rough prototype for changing the load balancer to simply pick the agent with the fewest number of users (not sessions, nor thinlinc users) on it. Note that it needs work as it doesn't consider varying machine capabilities, lots of logins in a short time, nor putting all sessions for a single user on the same agent.
(there is also still the fundamental question of what the basic principle of the load balancer should be)
We've had some more internal discussion about this, and we've tried to summarise the issues and feedback we've gotten:
* It's difficult to understand (and configure)
* It can be overly lopsided if servers differ in (perceived) capacity
* It doesn't spread risk
* Some would like their own, arbitrary conditions for selecting agents
Our current system is based on the principle of giving every user as much resources as possible, but it assumes a) that the system measures everything relevant, b) the admin knows the resource usage and configures it accordingly.
Changes that could be made:
* The systems tunes itself (addresses b)
* Equal number of sessions (or users) per agent (addresses a, or balance risk instead of load)
* Weighted number of sessions per agent (compromise between current model and simpler one)
* Allow a user script to select the agent (let the customer solve the problem)
A point of interest, perhaps: a comment from X2Go's config file about how their load values are calculated.
# The load factor calculation uses this algorithm:
# ( memAvail/1000 ) * numCPUs * typeCPUs
# load-factor = -------------------------------------- + 1
# loadavg*100 * numSessions
# (memAvail in MByte, typeCPUs in MHz, loadavg is (system load *100 + 1) as
# positive integer value)
# The higher the load-factor, the more likely that a server will be chosen
# for the next to be allocated X2Go session.
We also have seen evidence that some customers are using cgroups to limit resource usage per-user. This may be worth thinking about with regards to load-balancing too, as it offers a certain degree of predictability about future resource consumption.
Also note bug 284, which may or may not be relevant depending on what happens here.
I don't think that one "load balancer strategy" will fit all customers.
You could implement a couple of different and let the customer choose which to use.
Implement a variable loadbalance_strategy in vsmserver.hconf.
This could contain a single or a list of strategies
If the "best" agents have the same usercount, use the default strategy to select the best agent among these equal agents.
The loadbalance_strategy could be default(The current version), X2GO, usercount, memusage or others "standards" like round-robin, least-connection, source-ip-hash,load
As a bonus I would like to be able to add "custom" where the customer supplies a script(s) to be executed on agents.