Bug 7900 - tl-mount-localdrives may hang indefinitely
Summary: tl-mount-localdrives may hang indefinitely
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: Local drive redirection (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.16.0
Assignee: Samuel Mannehed
URL:
Keywords: ossman_tester, relnotes
Depends on:
Blocks:
 
Reported: 2022-04-26 16:11 CEST by William Sjöblom
Modified: 2024-01-24 09:21 CET (History)
4 users (show)

See Also:
Acceptance Criteria:
MUST: * tl-mount-localdrives should take a reasonable time to finish * Common use cases should still work: - Large folders, many files - Slow network SHOULD: * Communicate that localdrives are not working to the admins * If there is an issue in pynfs, we should share the solution with upstream. COULD: * Communicate that localdrives are not working to the end-user * A possible timeout should be configurable


Attachments

Description William Sjöblom cendio 2022-04-26 16:11:02 CEST
Running the ThinLinc 4.14.0 server on a SLES12 VM I managed to get tl-mount-localdrives to hang during session startup. This issue seems very racy and temperamental as I've had a hard time reproducing it. Nonetheless, out of sheer luck, I managed to reproduce it while running inside of pdb and managed to obtain the following stack trace:

> (Pdb) bt
>   /usr/lib64/python3.4/runpy.py(170)_run_module_as_main()
> -> "__main__", mod_spec)
>   /usr/lib64/python3.4/runpy.py(85)_run_code()
> -> exec(code, run_globals)
>   /usr/lib64/python3.4/pdb.py(1688)<module>()
> -> pdb.main()
>   /usr/lib64/python3.4/pdb.py(1661)main()
> -> pdb._runscript(mainpyfile)
>   /usr/lib64/python3.4/pdb.py(1542)_runscript()
> -> self.run(statement)
>   /usr/lib64/python3.4/bdb.py(431)run()
> -> exec(cmd, globals, locals)
>   <string>(1)<module>()
>   /opt/thinlinc/bin/tl-mount-localdrives(9)<module>()
> -> if 82 - 82: Iii1i
>   /opt/thinlinc/bin/tl-mount-localdrives(327)OoOOooO0oOO0Oo()
> -> oo00 = OoOo ( "127.0.0.1" , i1iIIi1I1iiI )
>   /opt/thinlinc/bin/tl-mount-localdrives(97)__init__()
> -> self . mountcl . null ( )
>   /opt/thinlinc/bin/tl-mount-localdrives(71)null()
> -> self . make_call ( mount_const . MOUNTPROC3_NULL )
>   /opt/thinlinc/bin/tl-mount-localdrives(61)make_call()
> -> Iio0 , i1i = self . pipe . listen ( oOoo0 )
>   /opt/thinlinc/modules/thinlinc/rpc/rpc.py(375)listen()
> -> self._pending[xid].wait(timeout)
>   /opt/thinlinc/modules/thinlinc/rpc/rpc.py(150)wait()
> -> self._filled.wait(timeout)
>   /usr/lib64/python3.4/threading.py(546)wait()
> -> signaled = self._cond.wait(timeout)
> > /usr/lib64/python3.4/threading.py(290)wait()
> -> waiter.acquire()

The issue was observed with a fresh ThinLinc install on an otherwise clean system. Looking at the client and xinit logs, nothing looks out of the ordinary.
Comment 1 Robert Henschel cendio 2023-09-08 22:00:32 CEST
I see the same issue. I have seen the issue on TL 4.14 (client and server) as well as TL 4.15 (client and server).
I am connecting to an number of systems, including the IU Research Desktop installation, the Cendio HPC installion on prem (tl.hpc.cendio.se), the Cendio general ThinLinc system (tl.cendio.se) as well as a system in the Oracle cloud and one that I maintain myself in AWS.

Only the tl.cendio.se system shows the isue with not being able to map storage. My client is Ubuntu 22.04.3 LTS.

Happy to provide more info as needed.
Comment 2 Linn cendio 2023-09-15 17:04:48 CEST
I got this issue again on SLES 12, and I saw that I had a bad/broken symlink to thindrives/ in my home directory after closing down the hanged session. Unfortunately, I did not inspect the symlink in more detail, I only noticed that it looked weird when running 'ls'.

The faulty symlink was probably set up before the client hanged, which might be a clue to where in the mounting this bug happens.
Comment 6 Pierre Ossman cendio 2024-01-16 09:34:35 CET
I got this repeatedly when using ThinLinc 4.16.0rc1 from a Fedora 38 client to a SLES 15 server. It locked up during session startup, and then also locked up every time I ran tl-mount-localdrives-manually. After a reboot, the issue mysteriously went away.

Doing a Ctrl+C showed the same backtrace as mentioned previously. At one point, I got this extra bit from a second thread, though:

> Exception in thread PollingThread:
> Traceback (most recent call last):
>   File "/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner
>     self.run()
>   File "/usr/lib64/python3.6/threading.py", line 864, in run
>     self._target(*self._args, **self._kwargs)
>   File "/opt/thinlinc/modules/thinlinc/rpc/rpc.py", line 509, in start
>     r,w,e = select.select(self.readlist, self.writelist, self.errlist)
> ValueError: file descriptor cannot be a negative integer (-1)

I could also see this secondary thread waiting for select() when I used strace.

To make things extra odd, ss shows no active connection between tl-mount-localdrives and unfsd. Perhaps it somehow misses that the connection has terminated/failed?
Comment 7 Pierre Ossman cendio 2024-01-16 10:46:18 CET
The connection management in pynfs is really poor. If a connection dies, for whatever reason, all pending requests will just hang.

This can easily be simulated by replacing the running unfsd with a listening netcat. If you run tl-mount-localdrives, then just close netcat, then tl-mount-localdrives will hang indefinitely.

The fix is not obvious, since pynfs heavily uses threads for handling communication. So it's not clear where it is safe to abort pending requests.

It is also still unknown if this is the scenario we're seeing in the "real" cases, and why the connection dies unexpectedly.
Comment 8 Pierre Ossman cendio 2024-01-16 10:49:18 CET
This gets worse by the fact that we are using ssh tunnels. If unfsd dies for some reason, then you'll also trigger the bug. This is because sshd still keeps the tunnel open, and it cannot propagate things like ECONNREFUSED, so all it does is close the connection.
Comment 9 Pierre Ossman cendio 2024-01-16 10:55:05 CET
This could happen during login. We set up the ssh tunnels before we start the services. So there is a brief period before unfsd is up and running where the bug could be triggered.
Comment 10 Pierre Ossman cendio 2024-01-16 11:06:44 CET
I found the tlclient.log from when I was repeatedly getting this issue with SLES 15, and I found this:

> 2024-01-15T17:24:03: unfsd: segmentation fault

No idea why unfsd crashed, but I think we fully understand the issue with tl-mount-localdrives now, at least.
Comment 11 Pierre Ossman cendio 2024-01-16 11:09:49 CET
I tested adding a timeout to listen(), and it is sufficient to break the hang. It uses a custom exception, though, that we need to catch.
Comment 12 Adam Halim cendio 2024-01-18 15:54:45 CET
Two seemingly independent hang-ups can be observed, one stemming from lack of a
timeout in tl-mount-localdrives when communicating with unfsd on the client
side, and another originating from an unknown cause of a timeout in the script
tl-mount-personal, which is called from tl-mount-localdrives.

When an issue occurs in tl-mount-personal, it is timed out after 30 seconds,
while in tl-mount-localdrives we can get stuck in perpetuity.  In the former
case, we still enter the session after the timeout, but without any mounted
local drives. In the latter case, we never get past the startup scripts. These
bugs seem separate enough to warrant a separate bug.
Comment 16 Adam Halim cendio 2024-01-19 14:28:05 CET
(In reply to Adam Halim from comment #12)
> ...
> These bugs seem separate enough to warrant a separate bug.

This is no longer relevant since the issue seen in tl-mount-personal does not hang indefinitely, and the issue is also only seen with an old ThinLinc client as far as we know (prior to resolving bug 8291).

However, the existence of this other timeout can be useful to be aware of when testing.
Comment 17 Adam Halim cendio 2024-01-19 14:56:19 CET
Through the investigation, we found that tl-mount-localdrives did not time out, e.g. when waiting for a crashed unfsd process. By adding a timeout according to comment #11, the issue seems to be solved.

To decide the timeout parameter, we tested throttling the network at 500 ms and 200 kbps, which should correspond to a slower than average internet connection. At this level of throttling, a timeout of 10 seconds was long enough to still manage to mount drives in a session. Hence, we chose 10 seconds as our timeout value.

The script tl-mount-localdrives has a retry loop, in which this timeout can occur. This loop will attempt to establish a connection 10 times, once every second. There are also multiple internal calls that may time out. Hence, the total timeout duration can be longer than 10 seconds.
Comment 18 Adam Halim cendio 2024-01-19 15:00:56 CET
Testing:
========

For testing, we made sure to use a linux client without the fix in bug 8291 (tlclient 4.15.0) running on RHEL 8. The server was installed on RHEL 9.

- Regular login
  No timeout reported in xinit.log and thindrives available
  
- Kill unfsd process on client side
  Timeout reported in xinit.log and thindrives unavailable. This
  process was killed while the profile chooser was awaiting a
  selection. 
  
- Throttled login
  Timeout reported in xinit.log and thindrives unavailable. The server
  side network was throttled to a minimum of 11s and 200 kbps.
Comment 19 Samuel Mannehed cendio 2024-01-19 15:41:50 CET
> MUST:
> * tl-mount-localdrives should take a reasonable time to finish

As described in comment 17, we can't say an exact upper limit for the total timeout. However, the added timeout of 10 seconds is hopefully not long enough for someone to give up.

> * Common use cases should still work:
>     - Large folders, many files
>     - Slow network

Yes, tested as described in comment 18. We also tested a "large folder" with ~100 000 files and folders.

It should be noted that some steps are run separately for each "mount point", i.e. each selected local drive folder.

> SHOULD:
> * Communicate that localdrives are not working to the admins

We added a log message for when the new timeout hits, it will be found in xinit.log.

> * If there is an issue in pynfs, we should share the solution with upstream.

All changes were made to our tl-mount-localdrives script, nothing to pynfs at this point.

> COULD:
> * Communicate that localdrives are not working to the end-user

No additional logging or messages were added. This timeout should hopefully be rare.

> * A possible timeout should be configurable

Not added at this time, we want to move this bug forward, and move on to a release.
Comment 25 Adam Halim cendio 2024-01-23 10:14:59 CET
Unfortunately, the timeout mechanisms that we added don't affect all parts of
the mounting process, which means there are cases where we have to wait for
longer than the timeout we've specified. In my testing, I consistently had to
wait ~75 seconds if unfsd were to crash during a boot. This is not ideal, but
in my opinion, acceptable, and at least not worse than how it was before.
Comment 26 Pierre Ossman cendio 2024-01-23 14:46:40 CET
Tested using SLES 15 as a server and Fedora 38 as the client.

> MUST:
> * tl-mount-localdrives should take a reasonable time to finish

Used a client with a crashing unfsd, and it was stuck for less than a minute during login.

> * Common use cases should still work:

Works fine as far as I can tell. Also tested logging in using Web Access and there was no delay and no "thindrives" folder.

>     - Large folders, many files

Works, but is slow (bug 525). I shared a folder¹ with 10k files, which mounts fine but takes 2.5 minutes to list. Not a regression, though, as I see the same with 4.15.0.

¹ /opt/cendio-build/arch/i386/usr/share/man/man3

>     - Slow network

Tested with 300 ms RTT and 1 Mbps and everything worked nicely.

> SHOULD:
> * Communicate that localdrives are not working to the admins

It's in xinit.log, which is a bit temporary to perfectly communicate to the admins. But it's in line with how other session issues work, so good enough for now.

I got this during the login:

> Running /opt/thinlinc/etc/xstartup.d/40-tl-mount-localdrives (Mounting local drives)
> tl-mount-localdrives: tl-mount-personal: mount: Connection timed out
> tl-mount-personal: Check that kernel supports all specified options and is 2.6.23 or later
> 
> Exception in thread PollingThread:
> Traceback (most recent call last):
>   File "/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner
>     self.run()
>   File "/usr/lib64/python3.6/threading.py", line 864, in run
>     self._target(*self._args, **self._kwargs)
>   File "/opt/thinlinc/modules/thinlinc/rpc/rpc.py", line 509, in start
>     r,w,e = select.select(self.readlist, self.writelist, self.errlist)
> ValueError: file descriptor cannot be a negative integer (-1)
> 
> tl-mount-localdrives: Timeout when mounting local drives.

Running it again in a terminal, I got:

> cendio@lab-134:~> tl-mount-localdrives 
> tl-mount-localdrives: Cannot connect to client: Operation timed out


> * If there is an issue in pynfs, we should share the solution with upstream.

That does not seem to have been done. We didn't do any fixes, but we could report the problems at least.

> COULD:
> * Communicate that localdrives are not working to the end-user

Not done, hopefully not needed.

> * A possible timeout should be configurable

Not done. Current timeouts seem reasonable, and should hopefully be rare.



Checked the new code, and everything looks good.

Release notes also look good.
Comment 27 Adam Halim cendio 2024-01-24 08:51:26 CET
(In reply to Pierre Ossman from comment #26)
> > * If there is an issue in pynfs, we should share the solution with upstream.
> 
> That does not seem to have been done. We didn't do any fixes, but we could
> report the problems at least.
The issue has now been reported to Pynfs: https://bugzilla.linux-nfs.org/show_bug.cgi?id=420
Comment 28 Adam Halim cendio 2024-01-24 09:11:16 CET
(In reply to Adam Halim from comment #27)

Also reported the bug where listen() will hang indefinitely: https://bugzilla.linux-nfs.org/show_bug.cgi?id=421
Comment 29 Pierre Ossman cendio 2024-01-24 09:21:46 CET
With the bug reports in place, that should be everything. Closing.

Note You need to log in before you can comment on or make changes to this bug.