With VSM Server rev 14337 and earlier, at high load and with some
bad luck, the following traceback was sometimes triggered:
2008-01-29 11:15:51 ERROR vsmserver.extcmd: Unhandled exception getting list of groups for user13: exceptions.NotImplementedError Unhandled write event on <thinlinc.vsm.extproc.GetUserGroups instance at 0x8e37b6c>:['/opt/thinlinc/libexec/tl-initgroups', u'user13']:35 Traceback (most recent call last):
File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 115, in iii1i
obj . handle_write_event ( fd )
File "/opt/thinlinc/modules/thinlinc/vsm/extproc.py", line 317, in handle_write_event
raise NotImplementedError ( "Unhandled write event on %s:%s:%d" % ( repr ( self ) , repr ( self . args ) , fd ) )
NotImplementedError: Unhandled write event on <thinlinc.vsm.extproc.GetUserGroups instance at 0x8e37b6c>:['/opt/thinlinc/libexec/tl-initgroups', u'user13']:35
2008-01-29 11:15:51 ERROR vsmserver: Killing process with pid 3486
Traceback (most recent call last):
File "/opt/thinlinc/sbin/vsmserver", line 20, in ?
VSMServer ( sys . argv )
File "/opt/thinlinc/modules/thinlinc/vsm/vsmserver.py", line 141, in __init__
self . loop ( )
File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 381, in loop
self . _select_loop ( oO000O0oo )
File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 245, in _select_loop
iii1i ( oO0 , OOOo0OO000 )
File "/opt/thinlinc/modules/thinlinc/vsm/async.py", line 120, in iii1i
obj . handle_error ( fd )
File "/opt/thinlinc/modules/thinlinc/vsm/extproc.py", line 351, in handle_error
self . handle_close ( fd )
File "/opt/thinlinc/modules/thinlinc/vsm/extproc.py", line 282, in handle_close
self . handle_exit ( )
File "/opt/thinlinc/modules/thinlinc/vsm/extproc.py", line 440, in handle_exit
Ii11Ii = [ int ( i1iiIII ) for i1iiIII in self . stdout_data . strip ( ) . split ( " " ) ]
ValueError: invalid literal for int():
The root cause for this problem was that the following could happen:
1) The list of readable fd's as returned by select.select was iterated over.
2) A readable object with fd X registered that was also in the
writable list would find that it was no longer needed, and
3) Another readable object would start up a third object, that
registered its FD, and that fd was by bad luck X.
4) When the list of writable fd's were iterated over, the wrong
object would be found when looking up fd X in fd_map, and if that
object was not equipped with a handle_write function, the
traceback would occur.
A good long-term solution to this is to make sure that the writable()
function only returns True when absolutely neccesary. As a short term
solution, I've made unregister_fd remove fd's that are unregistered
from the list of writable objects.
We don't have our own select loop any more as of bug 7636. It's likely that this issue is resolved.
This issue is indeed solved now.
The select loop is now handled by Python's asyncio, so the complexity is definitely out of our hands.
The listed scenario is also safe under asyncio as it builds a queue of all callbacks after calling select(). That means any changes in callbacks for different fd:s won't take effect until the next iteration of the loop.
In the scenario listed above, the original write callback will be called¹, not the new one.
¹ Which will likely crash as the fd is now closed. But that's a different issue and isn't necessarily the event loop's fault.
When testing, I decided to try and recreate the scenario described in comment 0 but in asyncio. Unfortunately, I did not succeed in recreating it.
However, the code in asyncio looks appropriate for this scenario and we have fully swapped from using our own implementation to using asyncio. Because of this, I believe this bug can be considered done, even without recreating this edge case. Closing.