Report from a customer: Once in a while, the master stops communicating with the agent, so people will be denied reconnecting to their sessions. /var/log/vsmserver.log: > 2017-01-05 10:10:20 WARNING vsmserver: Timeout trying to verify session for user on VSM Agent host:904 > 2017-01-05 10:10:20 INFO vsmserver.session: User with uid 1000 (user) requested a new session > 2017-01-05 10:10:20 WARNING vsmserver.session: User user has 1 sessions, only 1 are allowed per user. Denying start of new session > So, after enabling DEBUG2 in a few places... it looks like on each > attempt to reconnect to a session, part of checking the session status > involves running "netstat -n -A inet -A inet6 -t" on the agent. Here, something is filling up /var/log/secure with "open failed: connect failed: Connection refused" messages: > Jan 5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused > Jan 5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused > Jan 5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused > Jan 5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused > Jan 5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused > Jan 5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused > Jan 5 09:36:46 host sshd[24798]: channel 11: open failed: connect failed: Connection refused > This fills up the return of netstat with a boatloat of TIME_WAIT > entries, so netstat can take a little while to return. > In addition to all those timeouts on connect, it looks like something, > whilst a session is connected, is sitting there and opening conections > to the local sshd for a particular user. > [root@host ~]# netstat -anp | grep 4926 | grep -v WAIT > tcp 0 0 127.0.0.1:4926 0.0.0.0:* LISTEN 28920/sshd > tcp 0 0 127.0.0.1:59732 127.0.0.1:4926 ESTABLISHED - > tcp 0 40 127.0.0.1:64894 127.0.0.1:4926 ESTABLISHED - > tcp 0 1 127.0.0.1:46892 127.0.0.1:4926 SYN_SENT - > tcp 0 0 ::1:4926 :::* LISTEN 28920/sshd > [root@host ~]# ps -ef | grep 28920 > user2 28920 17074 19 09:57 ? 00:03:33 sshd: user2@notty > What's it doing, that it's sitting here repeatedly opening up > connections to the same sshd? > Well, I think I finally tracked it down, at least why the sshd process > was going banananas... stracing of it showed: > > ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffdce201760) = -1 EINVAL (Invalid argument) > > lots of those, as if something screwed up and it could no longer get to > whatever audio device. > > I started a brand new session and that error is not there and there > aren't a zillion ssh connections.
Also note bug 5982 where we are looking at a netstat replacement.
The customer reported that issue was encountered on a RHEL 6.8 system with ThinLinc 4.7.0.
Customer was using netstat from net-tools version 1.60 which is very old. This means the issue could not have been caused by a netstat update.
Did some more digging. There are two issues here: a) Reading /proc/net/tcp is slow b) netstat also does some more processing per entry that is slow So skipping netstat and reading /proc/net/tcp gives is some improvement, but doesn't fully solve the issue. The cause of the massive amount of sockets is probably bug 6133. The proper fix seems to be to switch to getting sockets via Netlink (directly or via "ss"). There is no performance problem with that method.
> The proper fix seems to be to switch to getting sockets via Netlink (directly > or via "ss"). There is no performance problem with that method. Non-exhaustive list of netstat alternatives I've found so far: Pure Python: ❌ socket.AF-NETLINK (Python >= 2.5) ❌ gnlpy (Python >= 2.6) https://github.com/facebook/gnlpy ❌ pyroute2 (Python >= 2.7) https://pypi.python.org/pypi/pyroute2 ❌ NetLink (Python >= 3) https://pypi.python.org/pypi/NetLink/ Subprocesses: - ss (iproute2) https://wiki.linuxfoundation.org/networking/iproute2
See also: bug 6142.
ss availability on select distributions | Dist | Present | Package Name | |--------------+---------+--------------| | RHEL 5 | ✓ | iproute | | RHEL 6 | ✓ | iproute | | RHEL 7 | ✓ | iproute | |--------------+---------+--------------| | Ubuntu 14.04 | ✓ | iproute2 | | Ubuntu 16.04 | ✓ | iproute2 | | Ubuntu 16.10 | ✓ | iproute2 | |--------------+---------+--------------| | SLES 10 | ✓ | iproute2 | | SLES 11 | ✓ | iproute2 | | SLES 12 | ✓ | iproute2 | |--------------+---------+--------------|
I just reproduced the issue, again. In both cases, the thinlinc session was started from my home Fedora25 machine over a Cisco VPN connection. Disconnected session there and then connected to it here at work from a thinstation, and that's when we get the multiple connection attempts to sshd. So, I can provide strace output from that sshd. I have not yet had the problem when starting a session at work on a thinstation and switching between thinstations. I also haven't specifically tested starting at work on my RHEL6 box and then switching to a thinstation. At least this time, the Cisco VPN disconnected me in the middle of sessions several time while at home. I would hope that didn't contribute to it. I checked netstat a lot from home and never saw the issue pop up.
Created attachment 770 [details] strace -s0 -p PID of ssh(d) on the LISTEN side of the zillion connection attempts
I found a systemtap script from http://unix.stackexchange.com/questions/186932/log-pid-of-each-connection-attempt that shoes me connect attempts. First thing I'm noticing... if I start a new session on my RHEL6 desktop here, and then disconnect that session, the escd I had running instantly goes bonkers trying to connect to my sshd which is no longer there. eg (systemtap output): connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 [root@universe ~]# ps -fu dwb7 |grep escd dwb7 4519 1 82 16:17 ? 00:02:55 ./escd --key_Inserted="/usr/bin/esc" --on_Signal="/usr/bin/esc" While this doesn't appear to fill up netstat (since there is nothing for escd to connect to), I imagine it could lead to file handle exhaustion. It would appear that at *some* point in the future, escd finally stops. If I reconnect, now on the thinstation, my sshd then starts back up, and for whatever reason seems to be listening on port 4926... same as before. And escd is now sitting there still doing zillions of connection attempts, but now resulting in lots of TIME_WAIT states and filling up the netstat. [root@universe ~]# lsof -nPi tcp:4926 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME sshd 15829 dwb7 12u IPv6 2417213418 0t0 TCP [::1]:4926 (LISTEN) sshd 15829 dwb7 13u IPv4 2417213420 0t0 TCP 127.0.0.1:4926 (LISTEN) Disconnect, and I'm back to escd trying to talk to non-existant sshd. Reconnect on my desktop, and escd is now happy and not trying to connect all over the place. Disconnect on my desktop, and escd goes nuts, again trying to connect to non-existant sshd. If I *start* the session on the ThinStation, I don't see any of this behaviour with escd, even though it is running: dwb7 15711 1 0 16:27 ? 00:00:00 ./escd --key_Inserted="/usr/bin/esc" --on_Signal="/usr/bin/esc"
Switching from netstat to ss will also solve bug 5982.
That should be it - all netstat calls has been replaced with ss, which should not exhibit the same poor performance on many simultaneous connections. To be clear, this doesn't fix applications that does a lot of connections, but it should not be a problem any longer.
The connection status is now very fast. Whilst netstat takes many seconds: > [root@dhcp-253-21 tl-4.7.0post-server]# time netstat -n -t | wc -l > 12578 > > real 0m12.156s > user 0m0.101s > sys 0m2.000s ss is extremely fast: > [pid 2442] 14:37:31.037513 execve("/usr/sbin/ss", ["ss", "-n", "-t", "state", "established"], [/* 22 vars */]) = 0 > [pid 2442] 14:37:31.055406 +++ exited with 0 +++ However the invocation for unbinding ports is still concerning: > [pid 2462] 14:37:31.187538 execve("/usr/sbin/ss", ["ss", "-n", "-t", "-a", "-p"], [/* 22 vars */]) = 0 > [pid 2462] 14:37:49.116805 +++ exited with 0 +++ That's a good 18 seconds. Not enough to trigger a timeout, but a fairly hefty wait for the user. Using ss' filters seem to make quite a dent in the time needed. Perhaps we can use that?
Side note, I can't reproduce these issues on my Fedora 25 machine so it seems modern kernels are more efficient. ss is still way faster than netstat, but netstat never seems to get in to those absurd conditions.
Retested, and works extremely well. For the "normal" case of no-one occupying ports it completely avoids digging around in /proc and is as fast as a normal invocation of ss. Otherwise it still finishes in less than a second.