7097 – vsmserver uses 100% CPU (broken control pipe)

Bug 7097 - vsmserver uses 100% CPU (broken control pipe)

Summary: vsmserver uses 100% CPU (broken control pipe)

Status:	CLOSED FIXED

Alias:	None

Product:	ThinLinc
Classification:	Unclassified
Component:	VSM Server (show other bugs)
Version:	1.3.1
Hardware:	PC Unknown

Importance:	P2 Normal
Target Milestone:	4.10.0
Assignee:	Pierre Ossman

URL:
Keywords:	relnotes, samuel_tester

Depends on:
Blocks:

Reported:	2018-01-15 13:18 CET by Pierre Ossman
Modified:	2019-02-13 15:41 CET (History)
CC List:	2 users (show)

See Also:
Acceptance Criteria:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Pierre Ossman cendio

2018-01-15 13:18:02 CET

I was setting up smart card authentication and somewhere during the process I managed to get vsmserver in a state where it was consuming 100% CPU and spamming this to the log:

> Unknown control command  received on control pipe
> Unknown control command  received on control pipe

No relevant log lines before that unfortunately, so I am not sure what the trigger was.

Stopping the process also shows something being wrong with the control pipe:

> 2018-01-15 13:02:03 INFO vsmserver: Got SIGTERM, signaling process to quit
> Traceback (most recent call last):
>   File "/opt/thinlinc/sbin/vsmserver", line 20, in <module>
>     VSMServer ( sys . argv )
>   File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 156, in __init__
>     self . loop ( )
>   File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 418, in loop
>     if 31 - 31: iII111i . OOooOOo - ooOoO0o . OoooooooOO / OoooooooOO
>   File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 170, in signal_handler
>     os . write ( self . control_pipe_write , "T" )
> OSError: [Errno 32] Broken pipe

This was seen on Fedora 27 with ThinLinc 4.8.0. I have failed to reproduce it though.

Comment 2 Pierre Ossman cendio

2018-07-09 17:27:36 CEST

I can't find a sane way for this bug to occur. There is nothing ever closing that fifo.

My best guess right now is that we have a bug elsewhere which closes the wrong fd and happens to nuke this one. It then gets replaced by something else (a closed socket?). Rather unlikely, but it would match the observed behaviour.

Comment 4 Pierre Ossman cendio

2018-10-16 13:35:25 CEST

Note that the above suggestion also needs to solve how to break a select() that is waiting for events. Either we need to start polling (i.e. set a maximum timeout for select()), or we need to keep some file descriptor that we can make active. The latter could still be made much simpler than the current system though.

We also looked at using pselect() and relying on EINTR, but pselect() doesn't exist in Python.

There is also still the question of where this bug occurs. If it's in some unrelated code then we might get other symptoms if we get rid of the control pipe. Perhaps some other file descriptor will start randomly closing in that case?


For now we'll await and see if we can get more information on when this occurs and hopefully get some clues to this mystery.

Comment 6 Karl Mikaelsson cendio

2018-12-21 13:33:13 CET

To summarize yesterdays investigations and theories:

We believe the pipe is closed by libnss-sss when doing a username lookup from Python.

* When our services daemonize, we close all open fds,
  including one presumably already opened by libnss-sss.

* Our services starts using this fd for the control pipe.

* When another username lookup is performed, nss_sss takes
  over the fd again, closing the pipe.


The commit from comment #5 does not seem to be related to the issue at hand.

Comment 7 Pierre Ossman cendio

2018-12-21 14:36:42 CET

sssd also has a cache that clients access via a mmap file. So they will only mess with the socket if they get a cache miss.

Fortunately you can force it to ignore the cache by starting a program with $SSS_NSS_USE_MEMCACHE=NO.

I've yet to reproduce the issue though. The key problem is getting a nss call before we open the control pipe.

Comment 8 Pierre Ossman cendio

2019-01-03 14:57:05 CET

The following test run produces an strace very similar to that given by the customer:

> $ strace -vv python -c "import os, pwd; pwd.getpwnam('derfian'); os.close(3); pwd.getpwnam('derfian')"

Comment 9 Pierre Ossman cendio

2019-01-03 15:01:03 CET

I think the cache might actually be important here. The strace shows an munmap() right before the evil close(). This matches well with the cleanup of the cache in nss-sss, so it looks like it is trying to close the cache and not a socket.

Will have to dig further to understand when this happens.


I also found a bug related to the socket handling that might be able to cause this. This bug was fixed in sssd 1.6.1 and is not relevant in the reported case though. But older installations might have multiple ways of triggering the bug in ThinLinc.

Comment 10 Pierre Ossman cendio

2019-01-04 08:42:27 CET

So the python test case above shows this sequence:

> openat(AT_FDCWD, "/var/lib/sss/mc/passwd", O_RDONLY|O_CLOEXEC) = 3
> fstat(3, {st_dev=makedev(253, 0), st_ino=1323918, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=16424, st_size=8406312, st_atime=1546526363 /* 2019-01-03T15:39:23.979942098+0100 */, st_atime_nsec=979942098, st_mtime=1546526389 /* 2019-01-03T15:39:49.699373353+0100 */, st_mtime_nsec=699373353, st_ctime=1546526389 /* 2019-01-03T15:39:49.699373353+0100 */, st_ctime_nsec=699373353}) = 0
> mmap(NULL, 8406312, PROT_READ, MAP_SHARED, 3, 0) = 0x7fde90df1000
> fstat(3, {st_dev=makedev(253, 0), st_ino=1323918, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=16424, st_size=8406312, st_atime=1546526448 /* 2019-01-03T15:40:48.976367300+0100 */, st_atime_nsec=976367300, st_mtime=1546526389 /* 2019-01-03T15:39:49.699373353+0100 */, st_mtime_nsec=699373353, st_ctime=1546526389 /* 2019-01-03T15:39:49.699373353+0100 */, st_ctime_nsec=699373353}) = 0

Here nss-sss initially opens the cache as a result of the getpwnam() call.

> close(3)                                = 0

Here we deliberately screw up the cache file descriptor for nss-sss.

> fstat(3, 0x7fff6da21790)                = -1 EBADF (Bad file descriptor)

So the next call to getpwnam() gets a failure trying to verify the file descriptor.

> munmap(0x7fde90df1000, 8406312)         = 0
> close(3)                                = -1 EBADF (Bad file descriptor)

Which results in it trying to free the cache, meaning unmapping the cache and closing the file descriptor (even though it is already closed).

> getpid()                                = 21609
> fstat(-1, 0x7fff6da21870)               = -1 EBADF (Bad file descriptor)
> getpid()                                = 21609
> socket(AF_UNIX, SOCK_STREAM, 0)         = 3
> fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
> fcntl(3, F_GETFD)                       = 0
> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
> connect(3, {sa_family=AF_UNIX, sun_path="/var/lib/sss/pipes/nss"}, 110) = 0

This is then the socket fallback being used as it had problems with the cache.

Comment 11 Pierre Ossman cendio

2019-01-04 08:47:00 CET

Compare this with the sequence from the customer report:

> getpeername(6, {sa_family=AF_INET, sin_port=htons(1022), sin_addr=inet_addr("xxxxxxxxx")}, [16]) = 0

This is the low port check in vsmagent, just before it calls the handler. And the next thing we see is when ReqSessionHandler calls getpwnam() to normalise the username:

> munmap(0x7fb43c40e000, 8406312)         = 0
> close(4)                                = 0

This is the cleanup sequence for the nss-sss cache. Note that there is no call to fstat() before this. This means the mapped memory failed to validate, since that is checked before it looks at the file descriptor (see sss_nss_check_header() in nss-sss).

> getpid()                                = 1061
> fstat(-1, 0x7ffd0e1347d0)               = -1 EBADF (Ungültiger Dateideskriptor)
> getpid()                                = 1061
> socket(AF_UNIX, SOCK_STREAM, 0)         = 4
> fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
> fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
> fcntl(4, F_GETFD)                       = 0
> fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
> connect(4, {sa_family=AF_UNIX, sun_path="/var/lib/sss/pipes/nss"}, 110) = 0

And then we have the socket fallback again since it considered the cache bad somehow.

Comment 12 Pierre Ossman cendio

2019-01-04 08:49:18 CET

What we can learn from this is that the final triggering condition is the sss cache going bad somehow. We have yet to experiment when this happens. Perhaps just when the cache gets invalidated or rotated? Restart of sssd?

It doesn't matter terribly much though as this should work fine anyway. The key issue is why nss-sss thinks the cache file descriptor is the same as the control pipe (i.e. 4). They should have independent fds.

We need earlier traces to see when and how this mixup occurs.

Comment 13 Pierre Ossman cendio

2019-01-04 13:19:18 CET

Alright, so I've understood what is going on now. The key ingredients are:

 * $HOME must not be set when vsmserver/vsmagent starts

 * "sss" must be configured before "files" in nsswitch.conf

This is true on very modern systems as they want to use sssd's cache even for local users. This requires sssd 1.16 or newer. RHEL 7 has a new enough version, but does not ship a nsswitch.conf that uses it though. I.e. its default configuration does not hit this bug.

What happens under these conditions is that python wants to find ~/.local/lib/python2.x/site-package. But since it doesn't have a $HOME, it needs to do a getpwuid(getpid()) call to figure out the home directory. This opens the cache and sets up the fd (which is then messed up by daemon_init()).

I'm not sure how the cache is then invalidated. I can easily provoke things using the sss_cache tool, but I doubt that is what happens on the live systems. I don't fully understand the cache handling though, so I'm having trouble finding the conditions for when this can trigger. We'll have to assume it happens under many real world conditions as we have both the customer report and my initial description.

Comment 14 Pierre Ossman cendio

2019-01-04 13:35:01 CET

You can see the problem before things actually explodes by looking at the fds and maps of the process:

> $ sudo ls -l /proc/`pgrep -f vsmagent`/fd
> total 0
> lr-x------. 1 root root 64 Jan  4 13:33 0 -> /dev/null
> l-wx------. 1 root root 64 Jan  4 13:33 1 -> /var/log/vsmagent.log
> l-wx------. 1 root root 64 Jan  4 13:33 2 -> /var/log/vsmagent.log
> lr-x------. 1 root root 64 Jan  4 13:33 3 -> 'pipe:[16229793]'
> l-wx------. 1 root root 64 Jan  4 13:33 4 -> 'pipe:[16229793]'
> lrwx------. 1 root root 64 Jan  4 13:33 5 -> 'socket:[16229798]'
> 
> $ sudo grep lib/sss /proc/`pgrep -f vsmagent`/maps
> 7fdfc86be000-7fdfc8ec3000 r--s 00000000 fd:00 1324160                    /var/lib/sss/mc/passwd

The cache file (/var/lib/sss/mc/passwd) should never be mapped without a corresponding fd open to it. So this state shows a process that will get screwed up as soon as the cache gets invalidated.

Comment 15 Pierre Ossman cendio

2019-01-04 13:56:02 CET

Web Access and tlwebadm are also affected by this bug, but not in the same way. In both servers fd 4 will be a new incoming connection. This gets forked off to a child that does all the meaty stuff. If that child does a getpw*() call, then it will drop the TCP connection (if the cache has been invalidated). In the browser you generally get some generic error message in these cases.

A key point to note here is that each child inherits the cache state from the parent. So once the cache is invalidated it will remain so until tlwebaccess/tlwebadm is restarted. The parent doesn't do any getpw*() calls itself to get the cache re-validated.

Web Access becomes entirely unusable at this point as it has to do a user lookup for the login. tlwebadm mostly works as it does very few such calls. The only places are when editing TLDC entries, our doing test lookups on the health page.

Comment 19 Pierre Ossman cendio

2019-01-07 13:41:10 CET

All fixed. Tester should make sure the bug can no longer be provoked by:

 * Configuring "passwd: sss files" in nsswitch.conf on a modern system (sssd 1.16+)

 * Running sss_cache -U after the services have started

General testing should also be done:

 * Starting a session

 * Logging in via web access

 * User lookup in tlwebadm

Comment 20 Pierre Ossman cendio

2019-01-09 12:57:02 CET

It seems like this can trigger whenever /etc/passwd or /etc/group changes. So that could be what's been happening on the client system.

Comment 21 Samuel Mannehed cendio

2019-02-13 15:41:14 CET

Verified that I could reproduce the problem on Fedora 29 with:
 * sssd-2.0.0.4
 * sss before files for passwd & groups in /etc/nsswitch.conf
 * ThinLinc server build 5957

- 100% CPU for vsmagent & vsmserver -processes on server
- Spam in logs: "Unknown control command received on control pipe"
- Couldn't login using Web Access: ERR_EMPTY_RESPONSE

Verified that it works with build 6040 on same server, same config:
 * Web Access login
 * Native client login
 * Web Adm
 * No spam in log
 * Sane CPU usage

Note You need to log in before you can comment on or make changes to this bug.