We've seen unfsd crash with a segfault with varying frequency during session startup. This seems to be the primary trigger for bug 7900. All we see in tlclient.log is: > 2024-01-16T14:25:49: unfsd: segmentation fault > 2024-01-16T14:25:49: Process 1306849 exited with code 1 Unclear exactly which clients have seen this. We've seen it with Fedora 38 and Fedora 39 at least.
I caught this with gdb, and got this: > Program received signal SIGSEGV, Segmentation fault. > (gdb) bt full > #0 0x00007fdb1cc5a4d0 in __GI_svc_getreq_poll (pollretval=<optimized out>, pfdp=<optimized out>) > at svc.c:440 > p = 0x1117448 > i = 1 > fds_found = 1 > #1 __GI_svc_getreq_poll (pfdp=0x1117440, pollretval=2) at svc.c:426 > fds_found = <optimized out> > i = <optimized out> > p = <optimized out> > #2 0x0000000000403421 in unfs3_svc_run () at daemon.c:828 > r = <optimized out> > r = <optimized out> > #3 main (argc=<optimized out>, argv=<optimized out>) at daemon.c:1007 > tcptransp = <optimized out> > udptransp = <optimized out> > pid = <optimized out> > act = {__sigaction_handler = {sa_handler = 0x1, sa_sigaction = 0x1}, sa_mask = {__val = {0, > 4857, 9175848, 12582912, 64, 140724358969024, 140724358969104, 140579058093551, 0, > 32768, 64, 12582912, 16, 0, 4, 262144}}, sa_flags = 0, sa_restorer = 0x0} > actset = {__val = {0, 4857, 9175848, 12582912, 64, 140724358969024, 140724358969104, > 140579058093551, 0, 32768, 64, 12582912, 16, 0, 4, 262144}} > res = 1 No idea why svc_getreq_poll() appears twice in the stack, with the arguments switched. The crash is because p->fd evaluates to -927297757, which gives you a way out of bounds access. The reason for that, is because the table given in pfdp is absurd. It contains: > (gdb) print pfdp > $4 = (struct pollfd *) 0x1117440 > (gdb) print pfdp[0] > $5 = {fd = 4375, events = 0, revents = 0} > (gdb) print pfdp[1] > $6 = {fd = -927297757, events = -24116, revents = 27757} > (gdb) print pfdp[2] > $7 = {fd = 5, events = 195, revents = 65} > (gdb) print pfdp[3] > $8 = {fd = 49, events = 0, revents = 0} The actual sockets open by unfsd have fd 3, 4, 5, and 6. Only entry 2 seems to be reasonable of the above. The real table has this contents: > (gdb) print svc_pollfd > $10 = (struct pollfd *) 0x111e780 > (gdb) print svc_pollfd[0] > $11 = {fd = 3, events = 195, revents = 65} > (gdb) print svc_pollfd[1] > $12 = {fd = 4, events = 195, revents = 0} > (gdb) print svc_pollfd[2] > $13 = {fd = 5, events = 195, revents = 65} > (gdb) print svc_pollfd[3] > $14 = {fd = 6, events = 195, revents = 0} This is very surprising, since unfsd calls svc_getreq_poll() with this table: > #2 0x0000000000403421 in unfs3_svc_run () at daemon.c:828 > 828 svc_getreq_poll(svc_pollfd, r); Somewhere between calling svc_getreq_poll(), and the crash, the value of svc_pollfd changes from 0x1117440 to 0x111e780. Things become less surprising once you look closer. That table gets realloc():ed whenever more room is needed. Something that happens when the server reaches a new maximum number of connected clients, something that happens a few times during login. glibc doesn't have a man page for svc_pollfd (or svc_getreq_poll()), but BSD does, and they explicitly warn that you need to make a copy of svc_pollfd: > svc_pollfd is a global variable reflecting the RPC service side's read file > descriptor array. This variable is only of interest if service implementors do > not call svc_run(), but rather do their own asynchronous event processing. This > variable is read-only, and it may change after calls to svc_getreq_poll() or > any creation routines. Do not pass it directly to poll(2)! Instead, make a copy > and pass that instead.
It's a big mystery why we're seeing this more frequently now. This code has been present in unfsd for 11 years (since bug 3104). The relevant code in glibc has been unchanged for 20 years. One theory is that the bug has always been present, but we've historically been lucky that the freed memory hasn't been overwritten. Something could have changed in glibc that makes it more likely that the old memory gets reused and hence corrupted.
The Windows code does not look affected. It doesn't use the poll interface, and it also makes a copy of the fd_set used in the older interface.
macOS also uses the older interface, so it should also be safe.
Fix is now in place. We try to better follow what svc_run() does. Fix also pushed upstream: https://github.com/unfs3/unfs3/commit/9ed6a635fafc43fd97d8977a9ea4f075dde8c96e Tested a bunch of new logins to my SLES 15 server, and was not able to provoke any crashes. The issue is hopefully fully resolved. > MUST: > > * Exported drives should successfully mount during login Yes, I can see the mounted drives from my client in the session. > SHOULD: > > * The fix should also be applied upstream Yup, see above.
The reasoning seems sane, the commit looks good and the acceptance criteria are fulfilled: > MUST: > > * Exported drives should successfully mount during login Yes. Using a 4.16.0rc1 client on Fedora 38, we could see the segfault crash ~80% of the logins. After upgrading to build 3385 we could no longer reproduce the crash. The server was running 4.16.0rc1 on RHEL 7. > SHOULD: > > * The fix should also be applied upstream Yep, see above links.