Bug 6125 - Unable to login due to excessive (audio?) connections to ssh
Summary: Unable to login due to excessive (audio?) connections to ssh
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: Other (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.8.0
Assignee: Karl Mikaelsson
URL:
Keywords: ossman_tester, relnotes
Depends on:
Blocks:
 
Reported: 2017-01-09 15:34 CET by Karl Mikaelsson
Modified: 2017-03-20 17:09 CET (History)
3 users (show)

See Also:
Acceptance Criteria:


Attachments
strace -s0 -p PID of ssh(d) on the LISTEN side of the zillion connection attempts (308.44 KB, application/x-gzip)
2017-01-30 19:24 CET, Dave Botsch
Details

Description Karl Mikaelsson cendio 2017-01-09 15:34:24 CET
Report from a customer:

Once in a while, the master stops communicating with the agent, so people
will be denied reconnecting to their sessions.

/var/log/vsmserver.log:
> 2017-01-05 10:10:20 WARNING vsmserver: Timeout trying to verify session for user on VSM Agent host:904
> 2017-01-05 10:10:20 INFO vsmserver.session: User with uid 1000 (user) requested a new session
> 2017-01-05 10:10:20 WARNING vsmserver.session: User user has 1 sessions, only 1 are allowed per user. Denying start of new session

> So, after enabling DEBUG2 in a few places... it looks like on each
> attempt to reconnect to a session, part of checking the session status
> involves running "netstat -n -A inet -A inet6 -t" on the agent.

Here, something is filling up /var/log/secure with "open failed: connect failed: Connection refused" messages:

> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused


> This fills up the return of netstat with a boatloat of TIME_WAIT
> entries, so netstat can take a little while to return.


> In addition to all those timeouts on connect, it looks like something,
> whilst a session is connected, is sitting there and opening conections
> to the local sshd for a particular user.

> [root@host ~]# netstat -anp | grep 4926 | grep -v WAIT
> tcp        0      0 127.0.0.1:4926              0.0.0.0:*      LISTEN      28920/sshd          
> tcp        0      0 127.0.0.1:59732             127.0.0.1:4926 ESTABLISHED -                   
> tcp        0     40 127.0.0.1:64894             127.0.0.1:4926 ESTABLISHED -                   
> tcp        0      1 127.0.0.1:46892             127.0.0.1:4926 SYN_SENT    -                   
> tcp        0      0 ::1:4926                    :::*           LISTEN      28920/sshd          

> [root@host ~]# ps -ef | grep 28920
> user2    28920 17074 19 09:57 ?        00:03:33 sshd: user2@notty 

> What's it doing, that it's sitting here repeatedly opening up
> connections to the same sshd?

> Well, I think I finally tracked it down, at least why the sshd process
> was going banananas... stracing of it showed:
>
> ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffdce201760) = -1 EINVAL (Invalid argument)
>
> lots of those, as if something screwed up and it could no longer get to
> whatever audio device.
>
> I started a brand new session and that error is not there and there
> aren't a zillion ssh connections.
Comment 2 Pierre Ossman cendio 2017-01-09 15:58:13 CET
Also note bug 5982 where we are looking at a netstat replacement.
Comment 3 Samuel Mannehed cendio 2017-01-10 13:21:30 CET
The customer reported that issue was encountered on a RHEL 6.8 system with ThinLinc 4.7.0.
Comment 4 Samuel Mannehed cendio 2017-01-10 14:32:59 CET
Customer was using netstat from net-tools version 1.60 which is very old. This means the issue could not have been caused by a netstat update.
Comment 5 Pierre Ossman cendio 2017-01-11 16:12:01 CET
Did some more digging. There are two issues here:

 a) Reading /proc/net/tcp is slow

 b) netstat also does some more processing per entry that is slow

So skipping netstat and reading /proc/net/tcp gives is some improvement, but doesn't fully solve the issue.

The cause of the massive amount of sockets is probably bug 6133.


The proper fix seems to be to switch to getting sockets via Netlink (directly or via "ss"). There is no performance problem with that method.
Comment 6 Karl Mikaelsson cendio 2017-01-30 10:56:37 CET
> The proper fix seems to be to switch to getting sockets via Netlink (directly
> or via "ss"). There is no performance problem with that method.

Non-exhaustive list of netstat alternatives I've found so far:

Pure Python:

 ❌ socket.AF-NETLINK (Python >= 2.5)

 ❌ gnlpy (Python >= 2.6)
   https://github.com/facebook/gnlpy

 ❌ pyroute2 (Python >= 2.7)
   https://pypi.python.org/pypi/pyroute2

 ❌ NetLink (Python >= 3)
   https://pypi.python.org/pypi/NetLink/

Subprocesses:

 - ss (iproute2)
   https://wiki.linuxfoundation.org/networking/iproute2
Comment 7 Karl Mikaelsson cendio 2017-01-30 10:58:31 CET
See also: bug 6142.
Comment 8 Karl Mikaelsson cendio 2017-01-30 14:06:19 CET
ss availability on select distributions

| Dist         | Present | Package Name |
|--------------+---------+--------------|
| RHEL 5       | ✓       | iproute      |
| RHEL 6       | ✓       | iproute      |
| RHEL 7       | ✓       | iproute      |
|--------------+---------+--------------|
| Ubuntu 14.04 | ✓       | iproute2     |
| Ubuntu 16.04 | ✓       | iproute2     |
| Ubuntu 16.10 | ✓       | iproute2     |
|--------------+---------+--------------|
| SLES 10      | ✓       | iproute2     |
| SLES 11      | ✓       | iproute2     |
| SLES 12      | ✓       | iproute2     |
|--------------+---------+--------------|
Comment 9 Dave Botsch 2017-01-30 19:23:09 CET
I just reproduced the issue, again.

In both cases, the thinlinc session was started from my home Fedora25 machine over a Cisco VPN connection. Disconnected session there and then connected to it here at work from a thinstation, and that's when we get the multiple connection attempts to sshd.

So, I can provide strace output from that sshd.

I have not yet had the problem when starting a session at work on a thinstation and switching between thinstations.

I also haven't specifically tested starting at work on my RHEL6 box and then switching to a thinstation.

At least this time, the Cisco VPN disconnected me in the middle of sessions several time while at home. I would hope that didn't contribute to it. I checked netstat a lot from home and never saw the issue pop up.
Comment 10 Dave Botsch 2017-01-30 19:24:40 CET
Created attachment 770 [details]
strace -s0 -p PID of ssh(d) on the LISTEN side of the zillion connection attempts
Comment 11 Dave Botsch 2017-01-30 22:29:51 CET
I found a systemtap script from http://unix.stackexchange.com/questions/186932/log-pid-of-each-connection-attempt that shoes me connect attempts.

First thing I'm noticing... if I start a new session on my RHEL6 desktop here, and then disconnect that session, the escd I had running instantly goes bonkers trying to connect to my sshd which is no longer there.

eg (systemtap output):

connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926

[root@universe ~]# ps -fu dwb7 |grep escd
dwb7      4519     1 82 16:17 ?        00:02:55 ./escd --key_Inserted="/usr/bin/esc" --on_Signal="/usr/bin/esc"

While this doesn't appear to fill up netstat (since there is nothing for escd to connect to), I imagine it could lead to file handle exhaustion. It would appear that at *some* point in the future, escd finally stops.

If I reconnect, now on the thinstation, my sshd then starts back up, and for whatever reason seems to be listening on port 4926... same as before.

And escd is now sitting there still doing zillions of connection attempts, but now resulting in lots of TIME_WAIT states and filling up the netstat.

[root@universe ~]# lsof -nPi tcp:4926
COMMAND   PID USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
sshd    15829 dwb7   12u  IPv6 2417213418      0t0  TCP [::1]:4926 (LISTEN)
sshd    15829 dwb7   13u  IPv4 2417213420      0t0  TCP 127.0.0.1:4926 (LISTEN)

Disconnect, and I'm back to escd trying to talk to non-existant sshd.

Reconnect on my desktop, and escd is now happy and not trying to connect all over the place.

Disconnect on my desktop, and escd goes nuts, again trying to connect to non-existant sshd.

If I *start* the session on the ThinStation, I don't see any of this behaviour with escd, even though it is running:

dwb7     15711     1  0 16:27 ?        00:00:00 ./escd --key_Inserted="/usr/bin/esc" --on_Signal="/usr/bin/esc"
Comment 21 Karl Mikaelsson cendio 2017-02-03 16:32:17 CET
Switching from netstat to ss will also solve bug 5982.
Comment 23 Karl Mikaelsson cendio 2017-02-06 10:37:37 CET
That should be it - all netstat calls has been replaced with ss, which should not exhibit the same poor performance on many simultaneous connections. To be clear, this doesn't fix applications that does a lot of connections, but it should not be a problem any longer.
Comment 24 Pierre Ossman cendio 2017-02-06 14:51:07 CET
The connection status is now very fast. Whilst netstat takes many seconds:

> [root@dhcp-253-21 tl-4.7.0post-server]# time netstat -n -t | wc -l
> 12578
> 
> real	0m12.156s
> user	0m0.101s
> sys	0m2.000s

ss is extremely fast:

> [pid  2442] 14:37:31.037513 execve("/usr/sbin/ss", ["ss", "-n", "-t", "state", "established"], [/* 22 vars */]) = 0
> [pid  2442] 14:37:31.055406 +++ exited with 0 +++

However the invocation for unbinding ports is still concerning:

> [pid  2462] 14:37:31.187538 execve("/usr/sbin/ss", ["ss", "-n", "-t", "-a", "-p"], [/* 22 vars */]) = 0
> [pid  2462] 14:37:49.116805 +++ exited with 0 +++

That's a good 18 seconds. Not enough to trigger a timeout, but a fairly hefty wait for the user.

Using ss' filters seem to make quite a dent in the time needed. Perhaps we can use that?
Comment 25 Pierre Ossman cendio 2017-02-06 15:01:14 CET
Side note, I can't reproduce these issues on my Fedora 25 machine so it seems modern kernels are more efficient. ss is still way faster than netstat, but netstat never seems to get in to those absurd conditions.
Comment 28 Pierre Ossman cendio 2017-02-14 14:54:36 CET
Retested, and works extremely well. For the "normal" case of no-one occupying ports it completely avoids digging around in /proc and is as fast as a normal invocation of ss. Otherwise it still finishes in less than a second.

Note You need to log in before you can comment on or make changes to this bug.