Some distributions set a low default limit on the number of file descriptors a user may have open. On CentOS, RHEL and Fedora the soft-limit is only 1024 with a hard-limit of 4096, for example. On large clusters vsmserver can exhaust this allocation, for example during a period where a lot of users are logging out at the same time (lots of agent hostname lookups going on, session database updates, logfile writes, etc). In most cases this simply results in specific threads crashing, for example:
2015-02-19 08:33:56 ERROR vsmserver.session: Unhandled exception trying to verify session for maastr1-ST0017 on VSM Agent cehca040:904: <class 'socket.gaierror'> [Errno -2] Name or service not known Traceback (most recent call last):
File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 332, in xmlrpc_call
File "/usr/lib64/python2.6/asyncore.py", line 337, in connect
File "<string>", line 1, in connect_ex
gaierror: [Errno -2] Name or service not known
and errors like:
2015-02-19 08:33:56 ERROR vsmserver.session: FAILURE WRITING SESSION DATABASE!: [Errno 24] Too many open files: '/var/lib/vsm/sessions.temp'
However in at least one instance, it has resulted in vsmserver crashing completely:
Traceback (most recent call last):
File "/opt/thinlinc/sbin/vsmserver", line 22, in <module>
File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 142, in __init__
File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 430, in loop
File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 388, in run_delayed_calls
File "/opt/thinlinc/modules/thinlinc/vsm/sessionstore.py", line 239, in periodic_session_update
File "/opt/thinlinc/modules/thinlinc/vsm/call_verifysession.py", line 34, in __init__
File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 280, in __init__
File "/opt/thinlinc/modules/thinlinc/vsm/xmlrpc.py", line 136, in __init__
File "/usr/lib64/python2.6/asyncore.py", line 288, in create_socket
File "/usr/lib64/python2.6/socket.py", line 184, in __init__
socket.error: [Errno 24] Too many open files
We should probably do a couple of things here:
1) Investigate if we can handle this error condition better somehow. It's a bit difficult to continue normally in such a situation, but perhaps we could wait and try again later or something.
2) Investigate if we can make vsmserver more efficient with its fd usage
3) Document this limitation (and possibly methods of increasing it) in the TAG and/or Platform Specific Notes.
This bug is now about investigating these specific tracebacks, not some general fix for running out of file descriptors. We should look at these stacks and see if we can improve error handling. E.g. by trying again later.