Bug 8013 - Socket timeouts can unintentionally be ignored
Summary: Socket timeouts can unintentionally be ignored
Status: CLOSED FIXED
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: Misc (show other bugs)
Version: 4.14.0
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.15.0
Assignee: Linn
URL:
Keywords: ossman_tester, prosaic, tobfa_tester
Depends on:
Blocks:
 
Reported: 2022-10-25 10:16 CEST by Frida Flodin
Modified: 2022-12-14 09:00 CET (History)
2 users (show)

See Also:
Acceptance Criteria:
* Timeout signals cast for a TLS handshake timeout should behave similarly to Python's SysExit. * If a timeout occurs, a timeout signal should be raised. * Other errors, e.g. internal errors, should still have correct error handling.


Attachments

Description Frida Flodin cendio 2022-10-25 10:16:21 CEST
Our HTTPServer handles all socket timeouts as expected and no traceback is logged.  

This is correct for the case where the timeout belongs to the socket of this object, but it will also catch socket.timeout from other sockets. Timeouts from these other sockets are real, unexpected errors, and we should not just ignore these.
Comment 1 Frida Flodin cendio 2022-10-26 14:55:37 CEST
It went a bit fast when I wrote the description. 

There are no real socket.timeout that we want to handle. It's our own HTTPServer that throws socket.timeout when the TLS handshake timeout is exceeded. It's this specific case that we want to handle as expected, other socket.timeout exceptions should result in logged traceback.
Comment 4 Linn cendio 2022-11-08 17:00:34 CET
There is a small difference between using 'except:' and 'except Exception:' in error handling.

'except' will catch all types of exceptions, even SysExit and other exception that inherit directly from BaseException. 'except Exception' will not catch SysExit and friends.

Since we want our new timeout exception to function more as a signal than an actual exception, we want it to mimic the behaviour of SysExit.
Comment 7 Linn cendio 2022-11-10 08:49:09 CET
Tested and everything seems to be working:

* Triggering a TimeoutSignal in handle_one_request() or in finish_request() causes the side not to load at all and the following log message to be printed:
> Request timed out: TimeoutSignal('client exceeded maximum timeout')
* Causing an internal error in the code still gives a traceback in the log.
Comment 9 Pierre Ossman cendio 2022-11-16 16:04:38 CET
The name of the exception class leaks in to the log files, which gets extra bad once it is obfuscated:

> 2022-11-16 16:03:15 ERROR tlwebaccess[10848]: [::ffff:10.47.1.240] Request timed out: OO0o000o('client exceeded maximum timeout')
Comment 10 Pierre Ossman cendio 2022-11-17 12:43:46 CET
Tested build 2924 using a RHEL 9 server. All the tests were done with both tlwebadm and tlwebaccess.

> * Timeout signals cast for a TLS handshake timeout should behave similarly to Python's SysExit.
>
> * If a timeout occurs, a timeout signal should be raised.

Timeouts seem to trigger correctly. I've tested:

 * TLS timeout by simply doing "telnet server 300"
 * Request timeout via a python script doing TLS handshake, then send "GET..." without an empty line
 * Response timeout via a python script doing a request for a 50 MiB file, then not reading anything back

> * Other errors, e.g. internal errors, should still have correct error handling.

Yup. I added a bad variable reference, which resulted in a 500 code and a traceback:

> ...
> 2022-11-16 16:28:46 ERROR tlwebaccess[11557]: [::ffff:10.47.1.240]   File "/opt/thinlinc/modules/thinlinc/tlwebaccess/server.py", line 92, in post_or_get
> 2022-11-16 16:28:46 ERROR tlwebaccess[11557]: [::ffff:10.47.1.240]     a = b
> 2022-11-16 16:28:46 ERROR tlwebaccess[11557]: [::ffff:10.47.1.240] NameError: name 'b' is not defined

I also checked that unhandled socket timeouts now also give a 500 code and a traceback by adding a socket connection to some random ip:

> ...
> 2022-11-16 16:27:09 ERROR tlwebaccess[11482]: [::ffff:10.47.1.240]   File "/opt/thinlinc/modules/thinlinc/tlwebaccess/server.py", line 91, in post_or_get
> 2022-11-16 16:27:09 ERROR tlwebaccess[11482]: [::ffff:10.47.1.240]     s.connect(("1.2.3.4", 9000))
> 2022-11-16 16:27:09 ERROR tlwebaccess[11482]: [::ffff:10.47.1.240] TimeoutError: [Errno 110] Connection timed out

However, I got the same thing when I tried in with 4.14.0. So it seems the issue didn't always occur.

After digging around I found that the issue only seems to happen with Python < 3.3 (which we don't support anyway), and Python >= 3.10. And RHEL 9 uses Python 3.9.

Testing the same thing with ThinLinc 4.14.0 on Fedora 35, which has Python 3.10, I get this in the log:

> 2022-11-17 12:37:24 ERROR tlwebadm[3181434]: [::ffff:127.0.0.1] Request timed out: TimeoutError(110, 'Connection timed out')

Which of course is incorrect. I should be getting a traceback. So the issue was real, and it is now confirmed fixed.

Lastly, normal usage seems to still work fine. Tested using Firefox.
Comment 12 Linn cendio 2022-11-17 14:36:19 CET
The error message should no longer contain the exception class, but rather only the informative string.

Tested on both Fedora 35 (Python 3.10) and RHEL 8 (Python 3.6), and the error message now looks like the following:
> Request timed out: client exceeded maximum timeout
Comment 13 Pierre Ossman cendio 2022-11-17 15:42:55 CET
Logs look much better now:

> 2022-11-17 15:41:50 ERROR tlwebaccess[26795]: [::ffff:10.47.1.240] Request timed out: client exceeded maximum timeout

Also looked at the unit tests, and they look good.

That should be everything. Closing.
Comment 14 Pierre Ossman cendio 2022-11-22 13:09:28 CET
Unfortunately, this doesn't work correctly with Python 3.10+. I just tested 4.14.0 in comment 10, not that the current build gives a correct traceback for socket timeouts.

The remaining problem is within the Python library, where it catches TimeoutError. We happened to get the desired behaviour on some versions of Python as they changed the sockets to throw TimeoutError, but the http.server code still tried to catch the old socket.timeout exception. This got "fixed" in 3.10.

It also doesn't look like Python has always caught these errors.

We should report this problem to Python, and we should see if there are any ways to work around it.
Comment 15 Pierre Ossman cendio 2022-11-25 15:51:11 CET
I've done some digging, and got a clearer picture of things now. The following scenarios are problematic in different Python versions:

Up to 2.6: Everything works properly
2.6 to 3.9: Timeouts from Python sockets are hidden (uses special exception socket.timeout, which has no low-level equivalent)
3.10 and forward: All types of timeouts are hidden. Everything that throws TimeoutError, which is everything where errno is ETIMEDOUT (connection timeout) and most (or all?) explicit timeout conditions within Python.

What happens is that we just get a call to log_error(), instead of handle_error().

Reported upstream:

https://github.com/python/cpython/issues/99777
Comment 16 Pierre Ossman cendio 2022-11-28 16:45:50 CET
I think it's possible to work around this.

The problem is in BaseHTTPRequestHandler.handle_one_request(), which catches TimeoutError and then immediately calls log_error(). We should be able to detect this in log_error(), and just let the exception propagate again.

It's very straight-forward to detect that we are in an exception handler, and that we are indeed handling a TimeoutError. The issue is that we don't want to block other calls to log_error() when handling TimeoutError as those might be reasonable. So, we want to detect that we are being called from the offending code.

Unfortunately, Python has no clean way of getting the callers. We can get at the stack, which gives us a bunch of information, but not the functions or methods themselves. It does give us the code object, though. Which we should be able to compare with the code object attached to BaseHTTPRequestHandler.handle_one_request().
Comment 18 Pierre Ossman cendio 2022-11-29 09:34:23 CET
Seems to work well. Tested on Python 3.10 and 3.4 to make sure it works even if socket.timeout is aliased to TimeoutError or not.

There is a small gotcha when re-raising exceptions though, and that is that it keeps adding the extra stuff to the traceback:

> ...
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]   File "/opt/thinlinc/modules/thinlinc/httpserver.py", line 380, in handle
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]     super ( ) . handle ( )
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]   File "/usr/lib64/python3.10/http/server.py", line 432, in handle
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]     self.handle_one_request()
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]   File "/opt/thinlinc/modules/thinlinc/websocketserver.py", line 57, in handle_one_request
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]     super().handle_one_request()
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]   File "/opt/thinlinc/modules/thinlinc/httpserver.py", line 155, in handle_one_request
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]     super ( ) . handle_one_request ( )
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]   File "/usr/lib64/python3.10/http/server.py", line 424, in handle_one_request
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]     self.log_error("Request timed out: %r", e)
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]   File "/usr/lib64/python3.10/http/server.py", line 420, in handle_one_request
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]     method()
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]   File "/opt/thinlinc/modules/thinlinc/websocketserver.py", line 69, in _websocket_do_GET
> 2022-11-29 09:19:09 ERROR tlwebaccess[3289159]: [::ffff:127.0.0.1]     self.do_GET()
> ...

Notice that the call to log_error() has been included there.
Comment 19 Tobias cendio 2022-12-13 16:26:31 CET
General
==================================================================
Tested using dev build #2953 on RHEL8 (Python 3.6.8) and Fedora37 (Python
3.11.0). All tests were done in the framework of ThinLinc Web Access in Firefox.

Acceptance criteria
==================================================================
[✓] Timeout signals cast for a TLS handshake timeout should behave similarly to
    Python's SysExit.

The idea is to mimic what generally happens when SystemExit exception is raised,
i.e. the exception filters through all exception handling of the sort 'except
Exception'. Since the introduced exception TimeoutSignal directly inherits
BaseException, it has the same effect in terms of exception filtering.

[✓] If a timeout occurs, a timeout signal should be raised.

Fabricating a timeout yields the desired TimeoutSignal exception handling with
no traceback.

[✓] Other errors, e.g. internal errors, should still have correct error
    handling.

Introducing e.g. a faulty reference or a TimeoutError correctly results in a
traceback.

Further testing
==================================================================
[✓] socket.timeout/TimeoutError ambiguity issue (comment 14 to comment 18).

This issue was in part tested using Python version 3.6.8. It was selected since
ThinLinc supports version 3.4+, and the socket.timeout/TimeoutError ambiguity
exists (at least) within versions 3.4-3.9. Fabricated exceptions of
socket.timeout and TimeoutErrors are introduced in
http.server/handle_one_request(). The expected behavior in ThinLinc is that
TimeoutError is treated as most regular exceptions, while socket.timeout slips
through with little notice (save for a line in the log) without the workaround
in place.

Missing workaround:
- socket.timeout: no traceback, a line in the log.
- TimeoutError: traceback.

With workaround:
- socket.timeout: traceback.
- TimeoutError: traceback.

Furthermore, this issue was tested using a Python version (Python 3.11) where
the socket.timeout/TimeoutError ambiguity is resolved. Here, the expected
behavior in ThinLinc is that both types of timeout exceptions results in a
traceback, should the workaround be in place. Else, they both slip through, save
for a line in the log.

Missing workaround:
- socket.timeout: no traceback, a line in the log.
- TimeoutError: traceback.

With workaround:
- socket.timeout: no traceback, a line in the log.
- TimeoutError: traceback.

Conclusion
==================================================================
The general issue how socket.timeout was being misused has been resolved. In
addition, a workaround for the ambiguity surrounding socket.timeout/TimeoutError
in various Python versions has been introduced.

Closing.
Comment 20 Tobias cendio 2022-12-14 08:52:10 CET
In the closing comment, comment 19, towards the end where the workaround was evaluated for Python version 3.11, it should've said that both aliases of the timeout error (socket.timeout & TimeoutError) yielded a traceback as expected.
Comment 21 Tobias cendio 2022-12-14 09:00:05 CET
Correcting comment 19 and clarifying comment 20: the workaround evaluated for Python version 3.11 yielded the following

Missing workaround:
- socket.timeout: no traceback, a line in the log.
- TimeoutError: no traceback, a line in the log.

With workaround:
- socket.timeout: traceback.
- TimeoutError: traceback.

The results above reflect the expected behavior we are looking for.

Note You need to log in before you can comment on or make changes to this bug.