Bug 7479 - Client doesn't report network issues from SSH
Summary: Client doesn't report network issues from SSH
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: Client (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.14.0
Assignee: Frida Flodin
URL:
Keywords: linma_tester, relnotes, wilsj_tester
Depends on:
Blocks: 7273
  Show dependency treegraph
 
Reported: 2020-03-20 09:45 CET by Pierre Ossman
Modified: 2022-01-14 16:24 CET (History)
3 users (show)

See Also:
Acceptance Criteria:
- If the SSH connection disconnects unexpectedly the user should get a notification about this. - The notification should give sufficient information so that the user know what to do next. - The user should only get a single error dialog.


Attachments

Description Pierre Ossman cendio 2020-03-20 09:45:09 CET
If ssh encounters a network error then we just terminate without a peep. That can be very confusing to the user who just sees their session vanish.

E.g. when the remote sshd crashes we have this in the log:

> 2020-03-20T09:40:54: ssh[E]: client_loop: send disconnect: Broken pipe
> 2020-03-20T09:40:54: Process 193612 exited with code 255

We should probably present an error dialog for any disconnect that wasn't a clean shutdown.

(See also bug 7273 which is about handling even more details)
Comment 1 Pierre Ossman cendio 2020-03-20 09:46:11 CET
A customer has some network corruption resulting in:

> 2020-03-19T13:59:27: ssh[E]: ssh_dispatch_run_fatal: Connection to 10.144.42.232 port 22: message authentication code incorrect

Which again didn't result in any message for the user.
Comment 4 Frida Flodin cendio 2021-08-31 14:39:16 CEST
I did some investigation on how to reproduce and some other normal shut-downs of ssh. 

To reproduce the broken pipe from comment #1 you need to kill the ssh process on the server side while the client tries to write to the pipe. I managed to do this by setting a delayed kill on the process and then moving the mouse around in the client. There might be some easier way to do this but this is what I did. 

If I do a hard kill on the ssh process on the server I get:
> 2021-08-31T14:23:37: ssh[E]: Connection to 10.48.2.172 closed by remote host.
> 2021-08-31T14:23:37: Process 18551 exited with code 0
Where PID 18551 is the ssh PID. As you can see this is a clean exit but still unexpected from the users point of view. 


Normal logout and disconnect will not generate any errors (ssh[E]) in the log after the connection is established. The same when admin terminates a session via tlwebadm. 
 
'Stealing' a session on the other hand, i.g. logging in as the same user from webaccess, will give this in the client log:
> 2021-08-31T14:29:20: ssh[E]: Connection to 10.48.2.172 closed by remote host.


To summarize, our ssh writes to stderr even when we consider it a normal shutdown. However, not sure if 'stealing' a session counts as a normal shutdown? Should we consider a shutdown bad as soon as the ssh process is ended before the vncviewer process?
Comment 10 Pierre Ossman cendio 2021-09-21 15:33:00 CEST
Socket I/O is a bit spread out in the OpenSSH code base, but we can see some themes that we can exploit:

 * Most high level functions (used mainly during startup and shutdown) will call sshpkt_fatal() on error

 * The main loop (used once connected) uses sshpkt_fatal() for write errors, and client_process_net_input() to read data from the socket and do error handling

Fortunately we already emit the message "SYSTEM ERROR: ..." from sshpkt_fatal(), so most cases are already caught. If we also emit the same during error handling in client_process_net_input() then we should cover almost everything. And that should be a safe place to assume the errors are fatal as the same code sets "quit_pending" that breaks the connection loop.
Comment 11 Pierre Ossman cendio 2021-09-22 09:09:04 CEST
You can force a connection reset on a reasonably modern Linux system via ss:

> sudo ss -t --kill "sport ssh"

Useful for testing.
Comment 21 Pierre Ossman cendio 2021-09-23 12:36:05 CEST
Everything should be in place now. We just need to test everything.

Besides the new functionality, there are some possible regression areas that need testing:

 * macOS event delegation (quit/show/hide)

 * vncviewer termination detection (remember to check smart card disconnect)

 * shadow confirmation handling (basically retest bug 7070)
Comment 22 William Sjöblom cendio 2021-09-23 14:34:23 CEST
To test for regressions after the refactoring applied to the subprocess
output handling I retested bug 7070 running Jenkins client build #2208
on Fedora 34.

The feedback from vncviewer arrives as expected and the logs look the
same apart from some intended additional verbosity.
Comment 23 William Sjöblom cendio 2021-09-23 15:45:52 CEST
Tested smart card automatic disconnect using Jenkins client build #2209
on a number of platforms:
• ☑ Fedora 34
• ☑ Windows 10
• ☑ macOS 11.6
Comment 24 William Sjöblom cendio 2021-09-23 17:10:56 CEST
Tested that the ThinLinc client (build #2209) is terminated along with `vncviewer'
when in the following scenarios:
• ☑ `vncviewer' killed by `tlclient'
  • See comment 23
• ☑ `vncviewer' killed directly
  • ☑ Fedora 34
  • ☑ Windows 10
  • ☑ macOS 11.6
• ☑ `Xvnc' killed on the server, indirectly terminating `vncviewer'
  • ☑ Fedora 34
  • ☑ Windows 10
  • ☑ macOS 11.6
Comment 25 Pierre Ossman cendio 2021-09-24 09:00:45 CEST
Tested that macOS event delegation still works:

 ✓ Quit
 ✗ Show¹
 ✗ Hide¹

¹ Bug 7769, so not a regression
Comment 26 Pierre Ossman cendio 2021-09-24 16:46:02 CEST
Tested the following scenarios on Fedora 34 (network test was on CentOS 7 though):

OK cases:

 ✓ Close vncviewer window
 ✓ Run tl-disconnect
 ✓ Log out of session
 ✓ Configure disconnect on smart card removal, and remove card

Bad cases:

 ✓ Simulate TCP RST using ss --kill (got "connection reset")
 ✓ Simulate crash of ssh using kill -SEGV (got generic message)
 ✓ Simulate crash of sshd using kill -SEGV (got "connection reset")
 ✓ Disconnect network from VM (got "no route to host" after long delay)
 ✓ Simulate network problem by iptables -j DROP (got "connection timed out" after long delay)
 ✓ Disconnected ethernet cable from client (got "network unreachable" after long delay)

Unclear cases:

 ✓ Session killed from tlwebadm (got no message)
 ✓ Session killed using "End existing session" (got no message)
 ✓ Session stolen by other client (got generic message)
Comment 27 Pierre Ossman cendio 2021-09-27 13:52:48 CEST
Tested the following scenarios on macOS 11:

OK cases:

 ✓ Close vncviewer window
 ✓ Run tl-disconnect
 ✓ Log out of session
 ✓ Configure disconnect on smart card removal, and remove card

Bad cases:

 ✓ Simulate TCP RST using ss --kill (got "connection reset")
 ✓ Simulate crash of sshd using kill -SEGV (got "connection reset")
 ✓ Disconnect network from VM (got "operation timed out" after a small delay)
 ✓ Simulate network problem by iptables -j DROP (got "operation timed out" after a small delay)
 ✓ Disconnected ethernet cable from client (got "operation timed out" after a small delay)

Unclear cases:

 ✓ Session killed from tlwebadm (got no message)
 ✓ Session killed using "End existing session" (got no message)
 ✓ Session stolen by other client (got generic message)
Comment 28 Pierre Ossman cendio 2021-09-28 09:59:28 CEST
Tested the following scenarios on Windows 10:

OK cases:

 ✓ Close vncviewer window
 ✓ Run tl-disconnect
 ✓ Log out of session
 ✓ Configure disconnect on smart card removal, and remove card

Bad cases:

 ✓ Simulate TCP RST using ss --kill (got "connection reset")
 ✓ Simulate crash of sshd using kill -SEGV (got "connection reset")
 ✓ Disconnect network from VM (got "connection reset" after a small delay)
 ✓ Simulate network problem by iptables -j DROP (got "connection reset" after a small delay)
 ✓ Disconnected ethernet cable from client (got "connection reset" after a small delay)

Unclear cases:

 ✓ Session killed from tlwebadm (got no message)
 ✓ Session killed using "End existing session" (got no message¹)
 ✓ Session stolen by other client (got no message²)

¹ Probably because I had no local devices enabled
² Got ECONNABORTED once (bug 289)
Comment 29 Pierre Ossman cendio 2021-09-28 09:59:46 CEST
That should hopefully be everything. Seems to work very nicely.
Comment 30 Linn cendio 2021-10-01 12:31:15 CEST
Tested the following scenarios against client build 2211 on Ubuntu 20.04. 


OK cases:

 ✓ Close vncviewer window
 ✓ Run tl-disconnect
 ✓ Log out of session
 - Configure disconnect on smart card removal, and remove card
     Skipping since testing on a VM

Bad cases:

 ✓ Simulate TCP RST using iptables (got "connection reset")
    - iptables -A INPUT -p TCP -j REJECT --reject-with tcp-reset
 ✓ Simulate crash of ssh using kill -SEGV (got generic message)
 ✓ Simulate crash of sshd using kill -SEGV (got generic message)
 ✓ Disconnect network from server VM (got "no route to host" after long delay)
 ✓ Simulate network problem by iptables -j DROP (got "connection timed out" after long delay)
 ✓ Disconnect network from client VM (got "connection timed out" after long delay)

Unclear cases:

 ✓ Session killed from tlwebadm (got no message)
 ✓ Session killed using "End existing session" (got no message)
 ✓ Session stolen by other client (got generic message)


Also read through the relnotes, they look good.
Comment 31 Pierre Ossman cendio 2022-01-10 13:58:20 CET
This doesn't seem to be completely reliable unfortunately. I've been testing killing ssh repeatedly and every now and then the error message doesn't show up. I guess there is a race somewhere we don't fully handle.
Comment 32 Pierre Ossman cendio 2022-01-10 14:02:34 CET
We're also getting double error messages if this happens within 10 seconds of connecting, which is probably not what we want. The "Your session only lasted 10 seconds..." message is for when we don't have anything better to show.
Comment 33 Pierre Ossman cendio 2022-01-13 13:23:23 CET
I'm unable to reproduce this. And I've looked over the code and can't really understand how this can happen, except for some race inside the kernel.

The sequence is:

 1. ssh gets a read error
 2. ssh terminates, becoming a zombie
 3. the kernel implicitly closes all sockets for ssh
 4. vncviewer gets an EOF on the socket
 5. vncviewer terminates, becoming a zombie
 6. client reaps the ssh zombie
 7. client reaps the vncviewer zombie
 8. client no longer finds vncviewer pid
 9. client checks for ssh pid

Some of these could be reordered. E.g. the client can reap the ssh zombie before vncviewer has had time to react and shut down. However the crucial steps are that 6 has happened before 9. And I don't see how that can be reordered.

vncviewer won't die until ssh has died, since that's what closes the socket. tlclient reaps all zombies in one go, so I don't see a way it could reap the vncviewer zombie, but not the ssh one.

The only way I could see something going wrong here is if the kernel has somehow done the implicit closing of ssh's sockets, but has not yet turned ssh in to a zombie. And at the same time, vncviewer is able to terminate and get reaped by tlclient.

This seems extremely unlikely, so I don't know what circumstances caused me to be able to reproduce this bug before.

As such we'll have to leave this for now until we get a more reliable way to reproduce the issue.
Comment 36 Pierre Ossman cendio 2022-01-13 13:38:05 CET
The issue of double error messages is fixed now at least.
Comment 37 William Sjöblom cendio 2022-01-14 16:24:15 CET
Tested using client build #2317 on Fedora 35:
• ☑ Manually disconnecting the session within 10 second of session
  startup still shows the "10 second dialog"
  • ☑ Clicking the close button on the window decoration
  • ☑ Pressing F8 -> Disconnect
  • ☑ `killall -9 vncviewer'
  • ☑ `tl-disconnect'
• ☑ Killing Xvnc within 10 seconds of session startup still results in
  the "10 second dialog" being shown
• ☑ Killing ssh within 10 seconds of session startup now *only* results
  in the correct "Unexpected shutdown" error dialog being shown
• ☑ Running `ss -t –kill "sport ssh"' on the server within 10 seconds of
  session startup now *only* results in the correct "Unexpected
  shutdown" error dialog being shown


Please note that after the above introduced changes, the "10 second"
dialog will no longer be shown in the following cases:
• When smart cards are removed (and the client is configured to exit on
  smart card removal)
• When tlclients receives a signal (e.g. SIGINT)

Note You need to log in before you can comment on or make changes to this bug.