Summary: | Unable to login due to excessive (audio?) connections to ssh | ||
---|---|---|---|
Product: | ThinLinc | Reporter: | Karl Mikaelsson <derfian> |
Component: | Other | Assignee: | Karl Mikaelsson <derfian> |
Status: | CLOSED FIXED | ||
Severity: | Normal | CC: | botsch, ossman, samuel |
Priority: | P2 | Keywords: | ossman_tester, relnotes |
Version: | trunk | ||
Target Milestone: | 4.8.0 | ||
Hardware: | PC | ||
OS: | Unknown | ||
Acceptance Criteria: | |||
Attachments: | strace -s0 -p PID of ssh(d) on the LISTEN side of the zillion connection attempts |
Description
Karl Mikaelsson
Also note bug 5982 where we are looking at a netstat replacement. The customer reported that issue was encountered on a RHEL 6.8 system with ThinLinc 4.7.0. Customer was using netstat from net-tools version 1.60 which is very old. This means the issue could not have been caused by a netstat update. Did some more digging. There are two issues here: a) Reading /proc/net/tcp is slow b) netstat also does some more processing per entry that is slow So skipping netstat and reading /proc/net/tcp gives is some improvement, but doesn't fully solve the issue. The cause of the massive amount of sockets is probably bug 6133. The proper fix seems to be to switch to getting sockets via Netlink (directly or via "ss"). There is no performance problem with that method. > The proper fix seems to be to switch to getting sockets via Netlink (directly > or via "ss"). There is no performance problem with that method. Non-exhaustive list of netstat alternatives I've found so far: Pure Python: ❌ socket.AF-NETLINK (Python >= 2.5) ❌ gnlpy (Python >= 2.6) https://github.com/facebook/gnlpy ❌ pyroute2 (Python >= 2.7) https://pypi.python.org/pypi/pyroute2 ❌ NetLink (Python >= 3) https://pypi.python.org/pypi/NetLink/ Subprocesses: - ss (iproute2) https://wiki.linuxfoundation.org/networking/iproute2 ss availability on select distributions | Dist | Present | Package Name | |--------------+---------+--------------| | RHEL 5 | ✓ | iproute | | RHEL 6 | ✓ | iproute | | RHEL 7 | ✓ | iproute | |--------------+---------+--------------| | Ubuntu 14.04 | ✓ | iproute2 | | Ubuntu 16.04 | ✓ | iproute2 | | Ubuntu 16.10 | ✓ | iproute2 | |--------------+---------+--------------| | SLES 10 | ✓ | iproute2 | | SLES 11 | ✓ | iproute2 | | SLES 12 | ✓ | iproute2 | |--------------+---------+--------------| I just reproduced the issue, again. In both cases, the thinlinc session was started from my home Fedora25 machine over a Cisco VPN connection. Disconnected session there and then connected to it here at work from a thinstation, and that's when we get the multiple connection attempts to sshd. So, I can provide strace output from that sshd. I have not yet had the problem when starting a session at work on a thinstation and switching between thinstations. I also haven't specifically tested starting at work on my RHEL6 box and then switching to a thinstation. At least this time, the Cisco VPN disconnected me in the middle of sessions several time while at home. I would hope that didn't contribute to it. I checked netstat a lot from home and never saw the issue pop up. Created attachment 770 [details]
strace -s0 -p PID of ssh(d) on the LISTEN side of the zillion connection attempts
I found a systemtap script from http://unix.stackexchange.com/questions/186932/log-pid-of-each-connection-attempt that shoes me connect attempts. First thing I'm noticing... if I start a new session on my RHEL6 desktop here, and then disconnect that session, the escd I had running instantly goes bonkers trying to connect to my sshd which is no longer there. eg (systemtap output): connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 connect [escd:4519] -> 127.0.0.1:4926 [root@universe ~]# ps -fu dwb7 |grep escd dwb7 4519 1 82 16:17 ? 00:02:55 ./escd --key_Inserted="/usr/bin/esc" --on_Signal="/usr/bin/esc" While this doesn't appear to fill up netstat (since there is nothing for escd to connect to), I imagine it could lead to file handle exhaustion. It would appear that at *some* point in the future, escd finally stops. If I reconnect, now on the thinstation, my sshd then starts back up, and for whatever reason seems to be listening on port 4926... same as before. And escd is now sitting there still doing zillions of connection attempts, but now resulting in lots of TIME_WAIT states and filling up the netstat. [root@universe ~]# lsof -nPi tcp:4926 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME sshd 15829 dwb7 12u IPv6 2417213418 0t0 TCP [::1]:4926 (LISTEN) sshd 15829 dwb7 13u IPv4 2417213420 0t0 TCP 127.0.0.1:4926 (LISTEN) Disconnect, and I'm back to escd trying to talk to non-existant sshd. Reconnect on my desktop, and escd is now happy and not trying to connect all over the place. Disconnect on my desktop, and escd goes nuts, again trying to connect to non-existant sshd. If I *start* the session on the ThinStation, I don't see any of this behaviour with escd, even though it is running: dwb7 15711 1 0 16:27 ? 00:00:00 ./escd --key_Inserted="/usr/bin/esc" --on_Signal="/usr/bin/esc" Switching from netstat to ss will also solve bug 5982. That should be it - all netstat calls has been replaced with ss, which should not exhibit the same poor performance on many simultaneous connections. To be clear, this doesn't fix applications that does a lot of connections, but it should not be a problem any longer. The connection status is now very fast. Whilst netstat takes many seconds: > [root@dhcp-253-21 tl-4.7.0post-server]# time netstat -n -t | wc -l > 12578 > > real 0m12.156s > user 0m0.101s > sys 0m2.000s ss is extremely fast: > [pid 2442] 14:37:31.037513 execve("/usr/sbin/ss", ["ss", "-n", "-t", "state", "established"], [/* 22 vars */]) = 0 > [pid 2442] 14:37:31.055406 +++ exited with 0 +++ However the invocation for unbinding ports is still concerning: > [pid 2462] 14:37:31.187538 execve("/usr/sbin/ss", ["ss", "-n", "-t", "-a", "-p"], [/* 22 vars */]) = 0 > [pid 2462] 14:37:49.116805 +++ exited with 0 +++ That's a good 18 seconds. Not enough to trigger a timeout, but a fairly hefty wait for the user. Using ss' filters seem to make quite a dent in the time needed. Perhaps we can use that? Side note, I can't reproduce these issues on my Fedora 25 machine so it seems modern kernels are more efficient. ss is still way faster than netstat, but netstat never seems to get in to those absurd conditions. Retested, and works extremely well. For the "normal" case of no-one occupying ports it completely avoids digging around in /proc and is as fast as a normal invocation of ss. Otherwise it still finishes in less than a second. |