Bug 4380 - restarting vsmserver and vsmagent at the same time causes sessions to be lost
Summary: restarting vsmserver and vsmagent at the same time causes sessions to be lost
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: VSM Server (show other bugs)
Version: 3.4.0
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.1.1
Assignee: Karl Mikaelsson
URL:
Keywords: hean01_tester, relnotes
Depends on:
Blocks:
 
Reported: 2012-08-22 09:25 CEST by Karl Mikaelsson
Modified: 2013-10-24 10:50 CEST (History)
2 users (show)

See Also:
Acceptance Criteria:


Attachments

Description Karl Mikaelsson cendio 2012-08-22 09:25:05 CEST
Reported in Issue 13605.

When both vsmserver and vsmagent are restarted simultaneous (this happens weekends on Ubuntu according to reports), vsmserver tries to verify sessions loaded from file with a vsmagent that hasn't started yet. Log extracts below.

2012-08-19 04:13:14 INFO vsmserver: Got SIGTERM, signaling process to quit
2012-08-19 04:13:14 INFO vsmserver: Terminating. Have a nice day!
2012-08-19 04:13:17 INFO vsmserver: VSM Server version 3.4.0post build 3568 started
2012-08-19 04:13:17 INFO vsmserver.license: Updating license data from disk to memory
2012-08-19 04:13:18 INFO vsmserver.license: License summary: 10 concurrent users. Hard limit of 11 concurrent users.
2012-08-19 04:13:18 INFO vsmserver.session: Loaded 2 sessions for 2 users from file
2012-08-19 04:13:18 WARNING vsmserver.session: Connection refused (ECONNREFUSED) talking to VSM Agent 127.0.0.1:904 in request to verify session for moetiker.
2012-08-19 04:13:18 INFO vsmserver.session: Session for moetiker on 127.0.0.1 have not been updated for 13474 seconds. Removing.
2012-08-19 04:13:18 WARNING vsmserver.loadinfo: Connection refused (ECONNREFUSED) talking to VSM Agent 127.0.0.1:904 in request for loadinfo. Marking as down.
2012-08-19 04:13:18 WARNING vsmserver.session: Connection refused (ECONNREFUSED) talking to VSM Agent 127.0.0.1:904 in request to verify session for zaucker.
2012-08-19 04:13:18 INFO vsmserver.session: Session for zaucker on 127.0.0.1 have not been updated for 13726 seconds. Removing.

2012-08-19 04:13:14 INFO vsmagent: Got SIGTERM, signaling process to quit
2012-08-19 04:13:14 INFO vsmagent: Terminating. Have a nice day!
2012-08-19 04:13:18 INFO vsmagent: VSM Agent version 3.4.0post build 3568 started
2012-08-19 04:13:18 INFO vsmagent: My public hostname is froburg.oetiker.ch
2012-08-19 04:13:19 WARNING vsmagent: Empty output from ssh-keyscan. Sshd not running?
2012-08-19 09:28:18 INFO vsmagent.session: Verified connectivity to newly started Xvnc for zaucker
2012-08-19 09:30:15 WARNING vsmagent.sessions: Broken session for user zaucker, tl-session process 30134 does not exist
2012-08-19 09:30:17 INFO vsmagent.session: Verified connectivity to newly started Xvnc for zaucker

I suppose vsmserver could mark the sessions as unreachable and give vsmagent a few more chances of responding before removing the session.
Comment 2 Karl Mikaelsson cendio 2013-08-08 11:24:10 CEST
Possible workaround:

I've asked the reporter to try to add "sleep 10" before restarting vsmserver in the postrotate script for vsm-server. Waiting for feedback.
Comment 6 Karl Mikaelsson cendio 2013-09-06 16:23:54 CEST
vsmserver tried to give the agent 3*10 minutes to recover from errors, but the timestamp it compared to was never written to disk unless a session was added/removed/changed. So for a server that did logrotates in the middle of the night with no users active, all sessions would have timestamps that were hours old. When vsmserver was restarted it did one try to verify the session, and when vsmagent couldn't be reached it looked at the timestamp and discarded the session for being too old.

Instead of using a timestamp to check when the session was last updated, we're now using an error counter that's reset when vsmserver is restarted, giving vsmserver three tries to verify the session before it gives up. 

The tester should make sure that HA still works, and that the value of this counter is unique per vsmserver.

Committed in r27842, r27843, r27844.
Comment 7 Karl Mikaelsson cendio 2013-09-10 16:01:35 CEST
Fixed an additional off-by-one problem found by autotests in r27857.
Comment 8 Henrik Andersson cendio 2013-10-17 14:03:08 CEST
Verified functionality using build 4116, 30 minutes passed before vsmserver removed user from session logs.

Still needs to test following:

1. Test the cleanup functionality:
 - stop and start agent to make sessions processes childs of the system.
 - verify that server can verify session by pid
 - kill session process
 - verify that server removes session within 10 minutes

2. Test HA functionality.
Comment 9 Henrik Andersson cendio 2013-10-17 14:35:12 CEST
(In reply to comment #8)

> Still needs to test following:
> 
> 1. Test the cleanup functionality:
>  - stop and start agent to make sessions processes childs of the system.
>  - verify that server can verify session by pid
>  - kill session process
>  - verify that server removes session within 10 minutes
> 

Verified that a broken session got removed after 10 minutes when tl-session process was killed.
Comment 10 Henrik Andersson cendio 2013-10-18 09:32:56 CEST
Tested using 2 x SLED11 Sp2, ThinLinc build 4116

A: Master service
B: Master and Agent services

> 2. Test HA functionality.

- Verified HA functionality, connection/reconnection through both masters etc.

- Killed tl-session and verified that session was cleaned up and removed from 
  master A and B.

- Stopped vsmagent and restarted vsmservers.
  Verified that session was cleaned up after 30 minutes. Master A deleted 
  session and sent delete notice to master B that deleteed and stopped trying to
  verify the session.

- With a active session, restarting vsmagent to make it a child of init process.
  Killed the tl-session process and verified that vsm servers cleaned up the 
  session from session database.

Note You need to log in before you can comment on or make changes to this bug.