Bug 4768 - Windows load balancing doesn't take the number of CPU:s into account when calculating load averages
Summary: Windows load balancing doesn't take the number of CPU:s into account when cal...
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: | WTS Tools (deprecated) (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Major
Target Milestone: 4.1.1
Assignee: Karl Mikaelsson
URL:
Keywords: astrand_tester, prosaic
Depends on:
Blocks:
 
Reported: 2013-08-13 18:32 CEST by Karl Mikaelsson
Modified: 2013-11-19 08:52 CET (History)
2 users (show)

See Also:
Acceptance Criteria:


Attachments

Description Karl Mikaelsson cendio 2013-08-13 18:32:22 CEST
(Reported on the ThinLinc-technical mailing list by Jens Langner - thanks!)

The load average numbers for Windows servers has a tendency to drop way into negative numbers, due to the algorithm not taking the number of cpu:s into account when doing the calculations.

The problematic part of the algorithm is this:

> free_bogomips = EST_BOGOMIPS * (1 - loadinfo.loadavg)

loadinfo.loadavg is a value reported from the Windows side. What this actually means and what range it should have is a bit unclear judging from comment 18 and 20 of bug 3864. It used to be a number in the range from 0 to 1, but is now a value from 0 to the number of cores in the Windows system.

The load balancer however still assumes that the number is 0 for no load and 1 for full load on all cores. When servers gain more and more cores, the load balancer will report the server running with full load at merely 1/(number of cores) load.
Comment 2 Pierre Ossman cendio 2013-08-14 08:28:34 CEST
For reference, the loadavg we report from VSM agent is adjusted on the agent side, not the master. It's probably best to use the same principle here.
Comment 3 Karl Mikaelsson cendio 2013-09-04 11:06:54 CEST
Fixed in r27829.

With regards to comment #2: I decided against changing the nrpe_nt code back to reporting 0..1 because it would affect all other users of nrpe_nt.
Comment 4 Peter Åstrand cendio 2013-10-15 09:24:44 CEST
(In reply to comment #3)
> Fixed in r27829.
> 
> With regards to comment #2: I decided against changing the nrpe_nt code back to
> reporting 0..1 because it would affect all other users of nrpe_nt.

This is confusing, it's better to revert to the earlier behaviour; how it worked before:

r116 | hean01 | 2012-05-25 08:28:02 +0200 (fre, 25 maj 2012) | 4 lines
Comment 5 Karl Mikaelsson cendio 2013-10-15 12:52:07 CEST
Fixed in r28035, r28036.
Comment 6 Peter Åstrand cendio 2013-10-17 12:54:39 CEST
Looks good now.
Comment 7 Jens Maus 2013-11-18 11:40:14 CET
As I have been the initial reporter of this bug and I just installed 4.1.1 on our systems I am curious what might be the actual status of affairs regarding the windows load balancing algorithm in 4.1.1? As I don't have access to the sources of ThinLinc I can only try to guess from the comments above about what was actually changed and to me it seems nothing was actually changed and the behavior of the tl-best-winserver and check_nrpe functionality is actually the same like in 4.1.0?!?

Is this actually the case and if so, why wasn't it changed and this bug closed? And if not, what was actually changed in the algorithm?
Comment 8 Karl Mikaelsson cendio 2013-11-18 14:04:04 CET
(In reply to comment #7)
> As I have been the initial reporter of this bug and I just installed 4.1.1 on
> our systems I am curious what might be the actual status of affairs regarding
> the windows load balancing algorithm in 4.1.1? As I don't have access to the
> sources of ThinLinc I can only try to guess from the comments above about what
> was actually changed and to me it seems nothing was actually changed and the
> behavior of the tl-best-winserver and check_nrpe functionality is actually the
> same like in 4.1.0?!?
> 
> Is this actually the case and if so, why wasn't it changed and this bug closed?
> And if not, what was actually changed in the algorithm?

Hi Jens,

The initial fix for this bug was to scale the load value back into the
range of 0-1 from 0-<cpus>. I initially solved this by scaling the
value I received from wts-tools on the "client" side (on the ThinLinc
server). However everyone wasn't happy with this solution, which led
me to reverting my own fix, and then later reverting the change in
nrpe_nt which changed the load value reported from 0-1 to 0-<cpus>.

Since all changes in the 4.1.1 release happened on the Windows side of
things, this means you also need to upgrade wts-tools to 4.1.1 when
you upgrade your ThinLinc server to 4.1.1. Perhaps this wasn't
communicated in a clear enough way from the comments here or the
release notes.
Comment 9 Jens Maus 2013-11-19 08:52:42 CET
(In reply to comment #8)

> The initial fix for this bug was to scale the load value back into the
> range of 0-1 from 0-<cpus>. I initially solved this by scaling the
> value I received from wts-tools on the "client" side (on the ThinLinc
> server). However everyone wasn't happy with this solution, which led
> me to reverting my own fix, and then later reverting the change in
> nrpe_nt which changed the load value reported from 0-1 to 0-<cpus>.

Thanks for that information. Now its clear to me what exactly was changed and that the load_avg value returned by check_nrpe will only be between 0 - 1. Thus, I changed my own 'tl-best-winserver' script to reflect that change with ThinLinc 4.1.1.

For reference and in case you are interested to review or somehow integrate my tl-best-winserver script with ThinLinc (it might be interesting for some users) please find the latest version here:

https://github.com/hzdr/thinstation/blob/master/ts/5.1/packages/hzdr/bin/scripts/tl-best-winserver

To explain why we are having an own version of tl-best-winserver, see here:

1. On our ThinClients (thinstation-based) we run an own GUI which allows to either directly connect to our windows terminal servers via xfreerdp or if a user chooses to connect to a Linux server it uses ThinLinc instead. Thus we needed a possibility to query our windows servers for the same load balancing information like ThinLinc is doing it internally.
2. we needed a tl-best-winserver command-line program which allows to override the username which is currently not possible with the version coming with ThinLinc.

> Since all changes in the 4.1.1 release happened on the Windows side of
> things, this means you also need to upgrade wts-tools to 4.1.1 when
> you upgrade your ThinLinc server to 4.1.1. Perhaps this wasn't
> communicated in a clear enough way from the comments here or the
> release notes.

Indeed, the release notes weren't particular clear on that as well as the comments here.

Note You need to log in before you can comment on or make changes to this bug.