Bug 7834 - Commandline tool for listing current loadbalancing status
Summary: Commandline tool for listing current loadbalancing status
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: Other (show other bugs)
Version: trunk
Hardware: PC Linux
: P2 Normal
Target Milestone: 4.15.0
Assignee: Tobias
URL:
Keywords: ossman_tester, relnotes
Depends on:
Blocks: 3707
  Show dependency treegraph
 
Reported: 2022-02-09 13:43 CET by Samuel Mannehed
Modified: 2023-04-28 16:38 CEST (History)
2 users (show)

See Also:
Acceptance Criteria:
* The admin should be able to use tlctl from the master to see load information about the agents in the cluster. * It should be possible to compare the number of users per agent in the cluster. * Agents which are down should be clearly marked as down. * Agents in different subclusters must be listed separately. * Using the load status data from tlctl the admin should be able to predict decisions made by the load balancer. * The load output from tlctl should be easy to understand. * The tlctl subcommand should give a helpful error message when trying to run without root/sudo permissions. * The tlctl subcommand should give a helpful error message when the vsmserver service isn't running. * Output should fit standard terminal width (80 columns)


Attachments

Description Samuel Mannehed cendio 2022-02-09 13:43:23 CET
There is no command-line tool for showing load balancing numbers for agents in a cluster. Currently, you have to view them through the tlwebadm GUI, which can be a bit inconvenient.
Comment 1 Samuel Mannehed cendio 2022-02-09 13:46:31 CET
The current idea is to make this a read-only tlctl module that simply outputs the cluster load information found in tlwebadmin.
Comment 13 Samuel Mannehed cendio 2022-02-18 11:13:31 CET
User stories
============

1. As an admin I want to see which agent has the most users currently.

2. A user reported that he got thrown out from his session and got a completely
   new session when reconnecting. As an admin I want to get an overview of the
   health of the cluster.

3. My monitoring software has indicated issues with one of the agents. As an
   admin I want to know what ThinLinc Master thinks about the state of the agent.

4. As an admin I want to have an overview of the load numbers on the agents
   in my cluster to decide how to tweak the load balancer configuration.

5. As an admin I want to see if my tweaked load balancer configuration behaves
   as I expect it to and predict which agent the load balancer will choose for
   the next new session.
Comment 22 Samuel Mannehed cendio 2022-02-24 14:45:21 CET
One thing that we haven't considered so far is subclusters. Since the load balancer works within each subcluster it's irrelevant to compare load numbers of two agents in different subclusters.

Technically this might be worth giving some extra thought. The load balancer itself and tlwebadm both use hiveconf to check the subcluster configuration. The load data from loadinfokeeper does not include subcluster information today.

I see two options:

 1) Include subcluster information in load data from loadinfokeeper. Either as a new field in the data for each agent, or by grouping the load data by subclusters.

 2) Use hiveconf to check the configuration. Note that this might impose further future limitations on tlctl since hiveconf only works on the local machine, and only ThinLinc Master servers contain relevant subcluster configuration.
Comment 23 Samuel Mannehed cendio 2022-02-24 14:51:50 CET
(In reply to Samuel Mannehed from comment #22)
>  2) Use hiveconf to check the configuration. Note that this might impose
> further future limitations on tlctl since hiveconf only works on the local
> machine, and only ThinLinc Master servers contain relevant subcluster
> configuration.

Linn pointed out that we actually already use hiveconf in tlctl's __init__.py to get the vsm_server_port. This means we have somewhat already have made a decision that using hiveconf in tlctl is fine.
Comment 32 Tobias cendio 2022-03-02 09:00:30 CET
> The load output from tlctl should be easy to understand.
Regarding acceptance criteria above, we considering following changes
- Sorted output: --sort option?
- Number of users: at least specify if system or thinlinc users
- Faulty values when down: some columns show ex. -100%
- Memory % row output order: in details view, above or below contributing memory
- Sentence case header case: consistent choice
- Align details output: explore different types of align layouts
- Perhaps show instead of details: considering using verbs like show
- Details header: might look better with header
Comment 77 Tobias cendio 2022-03-25 10:47:37 CET
Acceptance criterias:
=====================
* The admin should be able to use tlctl from the master to see load information about the agents in the cluster.

Detailed individual agent information achieved through '$ sudo tlctl load show <agent>', and less detailed information about all agents through '$ sudo tlctl load list' with additional sorting options for load list.

* It should be possible to compare the number of users per agent in the cluster.

Number of users displayed in the output mentioned above -- does not differentiate between system users and thinlinc users.

* Agents which are down should be clearly marked as down.

Values of downed agents emphasized to be non-existent and the agent rating displayed as 'DOWN'.

* Agents in different subclusters must be listed separately.

List is produced once for every subcluster, with subcluster included in the agent header if multiple subclusters are active. 

* Using the load status data from tlctl the admin should be able to predict decisions made by the load balancer.

The list subcommand tabulates individual agent ratings. Combined with further explanation of the rating quantity in the man page, an admin should be able to predict roughly how the loadbalancer will distribute additional sessions. 

* The load output from tlctl should be easy to understand.

Output aspects considered include consistent case use, truncation of excessive elements, logical order (particularly in subcommand show output), column alignments, labels clarified in terminal output and in man page, missing values emphasized, sorting option introduced for large outputs. 

* The tlctl subcommand should give a helpful error message when trying to run without root/sudo permissions.

Data retrieval is denied in text but how to get permission to do so is not stated -- although that might be obvious for most people.

* The tlctl subcommand should give a helpful error message when the vsmserver service isn't running.

If server is down, user is met with statement that connection was refused.

* Output should fit standard terminal width (80 columns)

Elements in the table output from tlctl load list is truncated so the whole table fits up to 80 characters, provided the number of columns does not increase tenfold in the future. Output from tlctl load show <agent> is not in danger of exceeding this limit.
Comment 78 Pierre Ossman cendio 2022-03-29 09:57:48 CEST
There is something amiss with the formatting in the man page for tlctl load. Needs to have another look.
Comment 80 Tobias cendio 2022-03-31 08:54:11 CEST
Documentation concerning comment #78 has been revised.
Comment 81 Pierre Ossman cendio 2022-03-31 09:37:16 CEST
The documentation contents is nice, but there are some stylistic things that stick out:

1. All other definition lists on the man pages are formatted as normal sentences (i.e. capital initial letter and full stop), so it looks very out of place when the title descriptions don't follow this style. It would also more naturally allow multiple sentences when needed.

2. There is no clear delimiter between the title descriptions and options for "tlctl load list". A paragraph or title delimiting things would be nice here.

3. Possible values to --sort isn't properly tagged as literals, so they don't get the correct formatting. Could possibly avoid mentioning them by just referring to the list of titles earlier on the page?

4. Some title descriptions just repeat the title, which just looks odd. Perhaps it's possible to elaborate a bit more?
Comment 82 Pierre Ossman cendio 2022-03-31 14:19:16 CEST
Testing done on RHEL 7 with one master and three agents.

 ✓ Information matches what is shown in tlwebadm¹, both in "list" and "show"²
 ✓ Agents are grouped per subcluster³
 ✓ No subcluster information is shown when there is just one
 ✓ Non-ASCII in subcluster names
 ✗ Non-ASCII in subcluster names with C locale
 ✓ Giving an unknown agent gives a proper error
 ✓ Downed agent is correctly shown both in "list" and "show"
 ✓ Sorting of agent names is case insensitive
 ✗ Sorting doesn't handle extreme cases like "åäö" or "€uro", compared to e.g. "ls"
 ✓ --sort works correctly for all headers
 ✓ --sort places DOWN at the bottom for all sorting by numerical value (everything except "agent")
 ✓ Agent name is truncated to make everything fit at 80 columns

¹ The agents had different resources, and I also generated various loads to show the numbers truly differed among agents.

² An overloaded CPU will show a percentage above 100%. This is probably confusing since you might think it is using the "top" method where maximum is #cores × 100%, but here 150% means it has 50% more queued than it can deal with.

³ The way subcluster is displayed is a bit odd though. It looks weird since it is in lowercase, but everything else is in uppercase. Were any other models explored?
Comment 87 Tobias cendio 2022-04-04 16:21:54 CEST
Resolved
(1) documentation formatting (comment #81)
(2) CPU load clarified by capping at 100% (comment #82)
(3) the way subcluster information is presented (comment #82)
Comment 88 Linn cendio 2022-04-05 10:30:31 CEST
(In reply to Pierre Ossman from comment #82)
> Testing done on RHEL 7 with one master and three agents.
> ...
>  ✗ Non-ASCII in subcluster names with C locale

This has been fixed in commit r38165 for the umbrella tlctl bug 3707. Tested on RHEL 8, and we now replace unprintable chars with '?' instead of getting a traceback when printing non-ascii subcluster names.
Comment 89 Linn cendio 2022-04-05 10:34:29 CEST
(In reply to Linn from comment #88)
> This has been fixed in commit r38165 for the umbrella tlctl bug 3707.
Correction, this was fixed in commit r38179 for bug 425 and bug 7833, which should have also been part of this bug.
Comment 91 Pierre Ossman cendio 2022-04-07 11:18:10 CEST
Retested on a bunch of RHEL 7 machines:

 ✓ Information matches what is shown in tlwebadm, both in "list" and "show"
 ✓ CPU now caps at 100%, even though I have an overloaded machine
 ✓ Agents are grouped per subcluster (looks nice now)
 ✗ Subcluster order is not consistent
 ✓ No subcluster information is shown when there is just one
 ✓ Non-ASCII in subcluster names
 ✓ Non-ASCII in subcluster names with C locale
 ✓ Non-ASCII in agent names
 ✓ Non-ASCII in agent names with C locale
 ✓ Non-ASCII agent name to "show"
 ✓ Non-ASCII agent name to "show" with C locale
 ✓ Giving an unknown agent gives a proper error
 ✓ Downed agent is correctly shown both in "list" and "show"
 ✓ Sorting of agent names is case insensitive
 ✗ Sorting doesn't handle extreme cases like "åäö" or "€uro", compared to e.g. "ls"
 ✓ --sort works correctly for all headers
 ✓ --sort places DOWN at the bottom for all sorting by numerical value (everything except "agent")
 ✓ Agent name is truncated to make everything fit at 80 columns
Comment 93 Pierre Ossman cendio 2022-04-08 12:33:17 CEST
I did some digging and found how Python does sorting of strings:

> Strings (instances of str) compare lexicographically using the numerical
> Unicode code points (the result of the built-in function ord()) of their
> characters. 3

https://docs.python.org/3/reference/expressions.html#value-comparisons

This is not particularly intuitive once you go outside of ASCII. E.g.:

> $ python3 -c 'print(sorted(["å", "ä", "ö"]))'
> ['ä', 'å', 'ö']

Their sorting tutorial mentions how to handle sorting of strings properly though:

> For locale aware sorting, use locale.strxfrm() for a key function or
> locale.strcoll() for a comparison function.

https://docs.python.org/3/howto/sorting.html#odd-and-ends

I think this is a bit obscure and easily overlooked though, so I've filed this issue with Python:

https://bugs.python.org/issue47259
Comment 102 Tobias cendio 2022-04-13 17:05:06 CEST
Recent changes:
- Subcluster sorting
- Subclusters displayed in subcommand show
- Numerous columns won't crash the truncation method
- Always preserve 3 leading characters after truncation, instead of potentially ending up with an element of simply '...'. 
- Non-ascii locale sorting of agent and subcluster names
Comment 110 Tobias cendio 2022-04-14 15:03:53 CEST
Recent changes:
- Subcluster information from subcommand show now correctly.
- Subcluster information in docs concerning subcommand show.
- Table truncation modified to preserve 1 character -- previously 3.
Comment 115 Pierre Ossman cendio 2022-04-22 14:30:36 CEST
Retested on the RHEL 7 machines again:

 ✓ No subcluster information is shown when there is just one (both for "list" and "show")
 ✓ Subcluster information is included in "show"
 ✓ Multiples subclusters are included in "show"
 ✓ Subcluster order follows locale
 ✓ Agent sorting follows locale (same as "ls")
 ✓ Agent name is truncated to make everything fit at 80 columns


Looks good!

Note You need to log in before you can comment on or make changes to this bug.