Some distributions set a low default limit on the number of file descriptors a user may have open. On CentOS, RHEL and Fedora the soft-limit is only 1024 with a hard-limit of 4096, for example. On large clusters vsmserver can exhaust this allocation, for example during a period where a lot of users are logging out at the same time (lots of agent hostname lookups going on, session database updates, logfile writes, etc). In most cases this simply results in specific threads crashing, for example: -- 2015-02-19 08:33:56 ERROR vsmserver.session: Unhandled exception trying to verify session for maastr1-ST0017 on VSM Agent cehca040:904: <class 'socket.gaierror'> [Errno -2] Name or service not known Traceback (most recent call last): File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 332, in xmlrpc_call File "/usr/lib64/python2.6/asyncore.py", line 337, in connect File "<string>", line 1, in connect_ex gaierror: [Errno -2] Name or service not known -- and errors like: -- 2015-02-19 08:33:56 ERROR vsmserver.session: FAILURE WRITING SESSION DATABASE!: [Errno 24] Too many open files: '/var/lib/vsm/sessions.temp' -- However in at least one instance, it has resulted in vsmserver crashing completely: -- Traceback (most recent call last): File "/opt/thinlinc/sbin/vsmserver", line 22, in <module> File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 142, in __init__ File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 430, in loop File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 388, in run_delayed_calls File "/opt/thinlinc/modules/thinlinc/vsm/sessionstore.py", line 239, in periodic_session_update File "/opt/thinlinc/modules/thinlinc/vsm/call_verifysession.py", line 34, in __init__ File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 280, in __init__ File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 136, in __init__ File "/usr/lib64/python2.6/asyncore.py", line 288, in create_socket File "/usr/lib64/python2.6/socket.py", line 184, in __init__ socket.error: [Errno 24] Too many open files -- We should probably do a couple of things here: 1) Investigate if we can handle this error condition better somehow. It's a bit difficult to continue normally in such a situation, but perhaps we could wait and try again later or something. 2) Investigate if we can make vsmserver more efficient with its fd usage 3) Document this limitation (and possibly methods of increasing it) in the TAG and/or Platform Specific Notes.
This bug is now about investigating these specific tracebacks, not some general fix for running out of file descriptors. We should look at these stacks and see if we can improve error handling. E.g. by trying again later.