Bug 4243 - vsm startup race: login fails if vsmagent is not ready when vsmserver starts
Summary: vsm startup race: login fails if vsmagent is not ready when vsmserver starts
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Server (show other bugs)
Version: 3.2.0
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.12.1
Assignee: Pierre Ossman
URL:
Keywords: frifl_tester, relnotes
Depends on: 4244
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-29 13:24 CEST by Peter Åstrand
Modified: 2021-01-21 12:45 CET (History)
2 users (show)

See Also:
Acceptance Criteria:
* New sessions should be attempted on the top two agents, based on this ordering: 1. agents where the user already has one or more sessions, in order of number of sessions 2. agents which are up, in order of rating 3. agents which are down, in order of rating


Attachments

Description Peter Åstrand cendio 2012-03-29 13:24:01 CEST
We have had this problem for a long time. If you start vsmserver before vsmagent, or if vsmagent is tad slow to start, vsmserver will directly mark the agent as down:

2012-03-29 04:00:20 INFO vsmserver: VSM Server version 3.3.0post build 3428 started
2012-03-29 04:00:20 INFO vsmserver.license: Updating license data from disk to memory
2012-03-29 04:00:20 INFO vsmserver.license: License summary: 10 concurrent users. Hard limit of 11 concurrent users. 
2012-03-29 04:00:20 WARNING vsmserver.loadinfo: Connection refused (ECONNREFUSED) talking to VSM Agent 127.0.0.1:904 in request for loadinfo. Marking as down.

This means that it is impossible to login:

2012-03-29 04:00:57 INFO vsmserver.session: User with uid 1000 (astrand) requested a new session
2012-03-29 04:00:57 WARNING vsmserver: No working agents found trying to start new session for astrand

You need to wait for a full load update cycle before the login works. I think this is an error in vsmserver: It shouldn't be so quick in marking the agent as down. Perhaps it should also retry an agent even if it's down, if there are no other agents available.
Comment 2 Pierre Ossman cendio 2020-07-09 11:02:12 CEST
This might have gotten worse because of bug 4290 as we might now start before the network is up. See bug 7531 for some more details.
Comment 3 Pierre Ossman cendio 2021-01-04 15:42:14 CET
I see two main approaches here:

 1. Don't check things right away, allowing vsmagent time to get going

 2. Add some handling for agents currently marked as down


The first approach doesn't really solve the race, just makes it less likely. So the question is what is sufficient time to wait? And what should happen if a client connects before we've done the initial check?


The second approach is probably easy to get going as we retest agents before we actually try to use them. However we then have to decide how to treat unresponsive agents compared to responsive ones. And if the agent is still unresponsive, what timeouts do we have and how do they impact the user?
Comment 4 Pierre Ossman cendio 2021-01-04 16:21:38 CET
If we want to do an as minimal change as possible, then unresponsive agents should not be tried until we've tried the responsive ones. We only try at most two agents, so in most cases there should be no change in behaviour at all as the cluster is large enough to hide one or two unresponsive agents.

Worst case scenario should be that all agents are unresponsive. Each check takes 
40 seconds to timeout, so a total of 80 seconds can pass before we pass back a failure to the client. The client has no timeout so it will happily sit there until we are done.

If more than one agent is unresponsive, then we also might want to consider which of those to prefer. Our load information will be stale, since the agents are unresponsive. Should we compare them using some other metric?
Comment 5 Pierre Ossman cendio 2021-01-05 10:44:04 CET
Oddly enough we currently don't care if a agent is down or not if the user already has a session on it. Since we want to group a user's sessions on the same machine this agent gets priority.

Note that this is only relevant if the user is allowed to have multiple sessions.
Comment 9 Niko Lehto cendio 2021-01-07 13:56:18 CET
Found out during testing of the solution that if both agent.service in a two agent cluster are down, after one is started up, the vsmserver.log keeps writing these two lines (every 40sec) even after the vsmagent.service is up and working on 127.0.0.1:
>> 2021-01-07 13:09:40 WARNING vsmserver.loadinfo: [Errno 111] Connection refused talking to VSM Agent 127.0.0.1:904 in request for loadinfo. Marking as down.
>> 2021-01-07 13:09:40 WARNING vsmserver.loadinfo: [Errno 111] Connection refused talking to VSM Agent 10.48.2.205:904 in request for loadinfo. Marking as down.

The same scenario on 4.12.0 server keeps printing only the agent that is actually down.
Comment 10 Niko Lehto cendio 2021-01-08 13:21:50 CET
(In reply to Niko Lehto from comment #9)
I was a little hasty here, when looking more into the logs it is apparent that the vsmagent.service on 127.0.0.1 was also down, which makes this log message intended. So disregard the last comment.

I could reproduce the issue where we could not connect to vsmagent.sevice if vsmserver.service was started first on 4.12.0 server.
I also tested nightly build on RHEL8 system and tested the following:

- Tested starting vsmserver.service before starting vsmagent service in system with one agent.

- I also tested alternating started/stopped vsmagent.service on clusters of two and three agents.

Tlclient can find and connect to functioning agents well.
Comment 11 Niko Lehto cendio 2021-01-08 16:13:14 CET
Also tested the acceptance criteria:
> agents where the user already has one or more sessions, in order of number of sessions
Ok
> agents which are up, in order of rating
Ok
> agents which are down, in order of rating
Ok
Comment 12 Pierre Ossman cendio 2021-01-08 16:27:42 CET
Seems to work like we want it now.
Comment 13 Frida Flodin cendio 2021-01-11 15:55:35 CET
Reproduced on Ubuntu20.04 with tl-4.12.0 server. I can also verify that the issue is fixed when upgrading to nightly build 6714. Also tested the acceptance criteria:

> * New sessions should be attempted on the top two agents, based on this > 
>   ordering:
> 
>   1. agents where the user already has one or more sessions, in order of number
>      of sessions
Yes, tested when the agent with sessions was down, we still try that one first. If user has two sessions on two agents that both are down, even if there is a third agent running it will not be tested. If user has sessions on more than one agent, the one with more sessions is tested first.

>   2. agents which are up, in order of rating
Yes

>   3. agents which are down, in order of rating
Yes

Note You need to log in before you can comment on or make changes to this bug.