Bug 6125

Summary:	Unable to login due to excessive (audio?) connections to ssh
Product:	ThinLinc	Reporter:	Karl Mikaelsson <derfian>
Component:	Other	Assignee:	Karl Mikaelsson <derfian>
Status:	CLOSED FIXED
Severity:	Normal	CC:	botsch, ossman, samuel
Priority:	P2	Keywords:	ossman_tester, relnotes
Version:	trunk
Target Milestone:	4.8.0
Hardware:	PC
OS:	Unknown
Acceptance Criteria:
Attachments:	strace -s0 -p PID of ssh(d) on the LISTEN side of the zillion connection attempts

Description Karl Mikaelsson cendio

2017-01-09 15:34:24 CET

Report from a customer:

Once in a while, the master stops communicating with the agent, so people
will be denied reconnecting to their sessions.

/var/log/vsmserver.log:
> 2017-01-05 10:10:20 WARNING vsmserver: Timeout trying to verify session for user on VSM Agent host:904
> 2017-01-05 10:10:20 INFO vsmserver.session: User with uid 1000 (user) requested a new session
> 2017-01-05 10:10:20 WARNING vsmserver.session: User user has 1 sessions, only 1 are allowed per user. Denying start of new session

> So, after enabling DEBUG2 in a few places... it looks like on each
> attempt to reconnect to a session, part of checking the session status
> involves running "netstat -n -A inet -A inet6 -t" on the agent.

Here, something is filling up /var/log/secure with "open failed: connect failed: Connection refused" messages:

> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused
> Jan  5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused


> This fills up the return of netstat with a boatloat of TIME_WAIT
> entries, so netstat can take a little while to return.


> In addition to all those timeouts on connect, it looks like something,
> whilst a session is connected, is sitting there and opening conections
> to the local sshd for a particular user.

> [root@host ~]# netstat -anp | grep 4926 | grep -v WAIT
> tcp        0      0 127.0.0.1:4926              0.0.0.0:*      LISTEN      28920/sshd          
> tcp        0      0 127.0.0.1:59732             127.0.0.1:4926 ESTABLISHED -                   
> tcp        0     40 127.0.0.1:64894             127.0.0.1:4926 ESTABLISHED -                   
> tcp        0      1 127.0.0.1:46892             127.0.0.1:4926 SYN_SENT    -                   
> tcp        0      0 ::1:4926                    :::*           LISTEN      28920/sshd          

> [root@host ~]# ps -ef | grep 28920
> user2    28920 17074 19 09:57 ?        00:03:33 sshd: user2@notty 

> What's it doing, that it's sitting here repeatedly opening up
> connections to the same sshd?

> Well, I think I finally tracked it down, at least why the sshd process
> was going banananas... stracing of it showed:
>
> ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffdce201760) = -1 EINVAL (Invalid argument)
>
> lots of those, as if something screwed up and it could no longer get to
> whatever audio device.
>
> I started a brand new session and that error is not there and there
> aren't a zillion ssh connections.

Comment 2 Pierre Ossman cendio

2017-01-09 15:58:13 CET

Also note bug 5982 where we are looking at a netstat replacement.

Comment 3 Samuel Mannehed cendio

2017-01-10 13:21:30 CET

The customer reported that issue was encountered on a RHEL 6.8 system with ThinLinc 4.7.0.

Comment 4 Samuel Mannehed cendio

2017-01-10 14:32:59 CET

Customer was using netstat from net-tools version 1.60 which is very old. This means the issue could not have been caused by a netstat update.

Comment 5 Pierre Ossman cendio

2017-01-11 16:12:01 CET

Did some more digging. There are two issues here:

 a) Reading /proc/net/tcp is slow

 b) netstat also does some more processing per entry that is slow

So skipping netstat and reading /proc/net/tcp gives is some improvement, but doesn't fully solve the issue.

The cause of the massive amount of sockets is probably bug 6133.


The proper fix seems to be to switch to getting sockets via Netlink (directly or via "ss"). There is no performance problem with that method.

Comment 6 Karl Mikaelsson cendio

2017-01-30 10:56:37 CET

> The proper fix seems to be to switch to getting sockets via Netlink (directly
> or via "ss"). There is no performance problem with that method.

Non-exhaustive list of netstat alternatives I've found so far:

Pure Python:

 ❌ socket.AF-NETLINK (Python >= 2.5)

 ❌ gnlpy (Python >= 2.6)
   https://github.com/facebook/gnlpy

 ❌ pyroute2 (Python >= 2.7)
   https://pypi.python.org/pypi/pyroute2

 ❌ NetLink (Python >= 3)
   https://pypi.python.org/pypi/NetLink/

Subprocesses:

 - ss (iproute2)
   https://wiki.linuxfoundation.org/networking/iproute2

Comment 7 Karl Mikaelsson cendio

2017-01-30 10:58:31 CET

See also: bug 6142.

Comment 8 Karl Mikaelsson cendio

2017-01-30 14:06:19 CET

ss availability on select distributions

| Dist         | Present | Package Name |
|--------------+---------+--------------|
| RHEL 5       | ✓       | iproute      |
| RHEL 6       | ✓       | iproute      |
| RHEL 7       | ✓       | iproute      |
|--------------+---------+--------------|
| Ubuntu 14.04 | ✓       | iproute2     |
| Ubuntu 16.04 | ✓       | iproute2     |
| Ubuntu 16.10 | ✓       | iproute2     |
|--------------+---------+--------------|
| SLES 10      | ✓       | iproute2     |
| SLES 11      | ✓       | iproute2     |
| SLES 12      | ✓       | iproute2     |
|--------------+---------+--------------|

Comment 9 Dave Botsch 2017-01-30 19:23:09 CET

I just reproduced the issue, again.

In both cases, the thinlinc session was started from my home Fedora25 machine over a Cisco VPN connection. Disconnected session there and then connected to it here at work from a thinstation, and that's when we get the multiple connection attempts to sshd.

So, I can provide strace output from that sshd.

I have not yet had the problem when starting a session at work on a thinstation and switching between thinstations.

I also haven't specifically tested starting at work on my RHEL6 box and then switching to a thinstation.

At least this time, the Cisco VPN disconnected me in the middle of sessions several time while at home. I would hope that didn't contribute to it. I checked netstat a lot from home and never saw the issue pop up.

Comment 10 Dave Botsch 2017-01-30 19:24:40 CET

Created attachment 770 [details]
strace -s0 -p PID of ssh(d) on the LISTEN side of the zillion connection attempts

Comment 11 Dave Botsch 2017-01-30 22:29:51 CET

I found a systemtap script from http://unix.stackexchange.com/questions/186932/log-pid-of-each-connection-attempt that shoes me connect attempts.

First thing I'm noticing... if I start a new session on my RHEL6 desktop here, and then disconnect that session, the escd I had running instantly goes bonkers trying to connect to my sshd which is no longer there.

eg (systemtap output):

connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926
connect [escd:4519] -> 127.0.0.1:4926

[root@universe ~]# ps -fu dwb7 |grep escd
dwb7      4519     1 82 16:17 ?        00:02:55 ./escd --key_Inserted="/usr/bin/esc" --on_Signal="/usr/bin/esc"

While this doesn't appear to fill up netstat (since there is nothing for escd to connect to), I imagine it could lead to file handle exhaustion. It would appear that at *some* point in the future, escd finally stops.

If I reconnect, now on the thinstation, my sshd then starts back up, and for whatever reason seems to be listening on port 4926... same as before.

And escd is now sitting there still doing zillions of connection attempts, but now resulting in lots of TIME_WAIT states and filling up the netstat.

[root@universe ~]# lsof -nPi tcp:4926
COMMAND   PID USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
sshd    15829 dwb7   12u  IPv6 2417213418      0t0  TCP [::1]:4926 (LISTEN)
sshd    15829 dwb7   13u  IPv4 2417213420      0t0  TCP 127.0.0.1:4926 (LISTEN)

Disconnect, and I'm back to escd trying to talk to non-existant sshd.

Reconnect on my desktop, and escd is now happy and not trying to connect all over the place.

Disconnect on my desktop, and escd goes nuts, again trying to connect to non-existant sshd.

If I *start* the session on the ThinStation, I don't see any of this behaviour with escd, even though it is running:

dwb7     15711     1  0 16:27 ?        00:00:00 ./escd --key_Inserted="/usr/bin/esc" --on_Signal="/usr/bin/esc"

Comment 21 Karl Mikaelsson cendio

2017-02-03 16:32:17 CET

Switching from netstat to ss will also solve bug 5982.

Comment 23 Karl Mikaelsson cendio

2017-02-06 10:37:37 CET

That should be it - all netstat calls has been replaced with ss, which should not exhibit the same poor performance on many simultaneous connections. To be clear, this doesn't fix applications that does a lot of connections, but it should not be a problem any longer.

Comment 24 Pierre Ossman cendio

2017-02-06 14:51:07 CET

The connection status is now very fast. Whilst netstat takes many seconds:

> [root@dhcp-253-21 tl-4.7.0post-server]# time netstat -n -t | wc -l
> 12578
> 
> real	0m12.156s
> user	0m0.101s
> sys	0m2.000s

ss is extremely fast:

> [pid  2442] 14:37:31.037513 execve("/usr/sbin/ss", ["ss", "-n", "-t", "state", "established"], [/* 22 vars */]) = 0
> [pid  2442] 14:37:31.055406 +++ exited with 0 +++

However the invocation for unbinding ports is still concerning:

> [pid  2462] 14:37:31.187538 execve("/usr/sbin/ss", ["ss", "-n", "-t", "-a", "-p"], [/* 22 vars */]) = 0
> [pid  2462] 14:37:49.116805 +++ exited with 0 +++

That's a good 18 seconds. Not enough to trigger a timeout, but a fairly hefty wait for the user.

Using ss' filters seem to make quite a dent in the time needed. Perhaps we can use that?

Comment 25 Pierre Ossman cendio

2017-02-06 15:01:14 CET

Side note, I can't reproduce these issues on my Fedora 25 machine so it seems modern kernels are more efficient. ss is still way faster than netstat, but netstat never seems to get in to those absurd conditions.

Comment 28 Pierre Ossman cendio

2017-02-14 14:54:36 CET

Retested, and works extremely well. For the "normal" case of no-one occupying ports it completely avoids digging around in /proc and is as fast as a normal invocation of ss. Otherwise it still finishes in less than a second.