7530 – Crash in GetLoad call causing agent to no longer be polled

Bug 7530 - Crash in GetLoad call causing agent to no longer be polled

Summary: Crash in GetLoad call causing agent to no longer be polled

Status:	CLOSED FIXED

Alias:	None

Product:	ThinLinc
Classification:	Unclassified
Component:	VSM Server (show other bugs)
Version:	trunk
Hardware:	PC Unknown

Importance:	P2 Normal
Target Milestone:	4.13.0
Assignee:	Frida Flodin

URL:
Keywords:	nikle_tester, prosaic

Depends on:
Blocks:

Reported:	2020-07-09 10:26 CEST by Pierre Ossman
Modified:	2021-01-18 13:29 CET (History)
CC List:	2 users (show)

See Also:
Acceptance Criteria:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Pierre Ossman cendio

2020-07-09 10:26:54 CEST

Right now we have a rather fragile design for the load monitoring. If anything unexpected happen then the loop that handles regular polling of an agent might get broken and that agent won't get properly updated load information.

This gets worse if the agent is ever detected as down, as it will then get stuck in a downed state forever. And agents start off as being considered down, so a failure in the first ever poll makes that agent gone permanently.

The technical reason for the problem is that the response callback from GetLoad must always be called as that is what schedules a new check. And the current design wasn't built with the goal of making sure the callback is always called. So we need either a redesign, or make sure the scheduling happens in some other manner.

To fix the issue vsmserver must be restarted.

Comment 2 Frida Flodin cendio

2021-01-14 16:32:40 CET

This should work better now. We now handle all unexpected errors from the load updater and the callback is always called nevertheless.

To reproduce: see bug 7531

Tester needs to make sure that we never stop trying to update the load for an agent. Even if we got an error the last time.

Comment 3 Niko Lehto cendio

2021-01-15 14:46:30 CET

Tested on RHEL8 server with nightly.

If loadbalancer.py/update_loadinfo() encounters an error early on (Before we update our loadstatus) we will re-try to update loadinfo immediately which is propably as we want it to be. This will spam the log extremly if this is an persistent error.

Comment 5 Frida Flodin cendio

2021-01-15 16:16:11 CET

(In reply to Niko Lehto from comment #3)
> If loadbalancer.py/update_loadinfo() encounters an error early on (Before we
> update our loadstatus) we will re-try to update loadinfo immediately which
> is propably as we want it to be. This will spam the log extremly if this is
> an persistent error.

Fixed now

Comment 6 Niko Lehto cendio

2021-01-18 13:29:46 CET

Re-tested with nightly build 6721.
Works well now! The polling continues even if crashes happen, and this does not spam the log.

Note You need to log in before you can comment on or make changes to this bug.