Bug 8161 - Python cgi module is deprecated
Summary: Python cgi module is deprecated
Alias: None
Product: ThinLinc
Classification: Unclassified
Component: Other (show other bugs)
Version: trunk
Hardware: PC Unknown
: P2 Normal
Target Milestone: 4.16.0
Assignee: Tobias
Keywords: relnotes, samuel_tester
Depends on:
Reported: 2023-05-31 12:55 CEST by Pierre Ossman
Modified: 2023-12-13 14:56 CET (History)
4 users (show)

See Also:
Acceptance Criteria:
MUST: * There should be no regressions in functionality for POST events done to our tlwebadm and tlwebaccess when replacing cgi.py with a new HTTP-API. * If the new HTTP-API does not cover a certain type of POST request/event, it should be made clear that this is not supported (and needs to be implemented). * The new HTTP-API should be written with extensibility in mind. * tlclient.cgi should not depend on any modules provided by ThinLinc.


Description Pierre Ossman cendio 2023-05-31 12:55:16 CEST
The cgi module in Python will be removed in Python 3.13 (likely released late 2024). We use this in a few places in ThinLinc, so we need to find a different solution:

 * tlclient.cgi
 * profile chooser
 * cheetah
 * tlwebadm
 * tlwebaccess
Comment 1 Tobias cendio 2023-08-25 09:35:34 CEST
As mentioned in comment 1, the cgi module is used in various components of ThinLinc, primarily concerning web services, with the profile chooser being the odd one here.

In the profile chooser code however, the module is effectively no longer in use as that part of the code was redesigned in r38481 leaving only the import behind. Hence, we can scrap importing the module here without problems.

For the web services tlwebadm, tlwebaccess, and the web integration component tlclient.cgi, the cgi module is used for some POST requests and should be fairly straightforward to replace -- see the documentation https://docs.python.org/3/library/cgi.html where there are some suggestions.

I haven't evaluated cheetah component thoroughly yet, but at first glance the replacement process seems a bit more complicated than the ones mentioned above.
Comment 2 Tobias cendio 2023-08-28 08:08:25 CEST
There are several paths one might go in replacing/decoupling the deprecated cgi module.

For starters, since it is a relatively compact module relying only on the standard library -- for which no other included modules have planned deprecation as far as I can tell -- one might simply ship the entire thing independently.

Another solution is to trace the logic in cgi.py and pick what we need for parsing of the http requests. In tlwebaccess, tlwebadm, and tlclient.cgi, we need to parse POST requests and obtain a tractable query. At its core, this is performed in the function read_urlencoded of the class FieldStorage, by decoding the stdin bytes buffer and translate it to keywords and values in a few steps:

qs = self.fp.read(self.length)
qs = qs.decode(self.encoding, self.errors)
query = urllib.parse.parse_qsl(
            qs, self.keep_blank_values, self.strict_parsing,
            encoding=self.encoding, errors=self.errors,
            max_num_fields=self.max_num_fields, separator=self.separator)

Note that for multipart requests, there's a separate function. From that point the information is dictionary-like and the class function FieldStorage.getvalue() that is used abundantly in the aforementioned ThinLinc services should have a trivial replacement. 

A third solution, as suggested on https://docs.python.org/3/library/cgi.html, is employing the multipart module not included in the standard library. This path does not seem like the most appropriate solution since we should be able to get what we need through the standard library which is always appreciated.
Comment 3 Samuel Mannehed cendio 2023-08-29 19:35:54 CEST
See bug 8162 comment 12 for estimates as to which future distros this deprecation are likely to affect. That bug also regards a Python module being deprecated in version 3.13.
Comment 4 Samuel Mannehed cendio 2023-08-29 19:37:08 CEST
(In reply to Samuel Mannehed from comment #3)
> That bug also regards a Python module being deprecated in version 3.13.

removed in version 3.13*
Comment 5 Alexander Zeijlon cendio 2023-09-01 12:49:56 CEST
If we chose to go for the option [1] to ship cgi.py in its entirety, we should also bring the upstream unit tests for it into our environment.

[1] Let's call this Option 1.

These can be found in Lib/test/test_cgi.py in the upstream git repo at https://github.com/python/cpython
Comment 6 Alexander Zeijlon cendio 2023-09-01 13:09:23 CEST
Also for Option 1 we need to be sure that we select (or modify) a version of cgi.py that is compatible with our Python requirement version.
Comment 7 Alexander Zeijlon cendio 2023-09-04 14:21:37 CEST
More on Option 1.

This is likely not the best choice for us due to a couple of reasons:

* On the docs page for cgi.py, its deprecation is partly motivated by "[…] designed poorly and are now near-impossible to fix (cgi) […]". With that in mind, we would rather not have to maintain this module ourselves.

* cgi.py supports only one request per connection, whereas we in the future want to be able to support multiple asynchronous requests in parallel.

* cgi.FieldStorage looks like it is built to be extended by subclasses. This is something we are not interested in, since we currently only want to handle basic POST requests before passing them to our backend.

In summary, Option 1 is not the most beneficial alternative for us.
Comment 8 Alexander Zeijlon cendio 2023-09-04 14:38:07 CEST
Option 2 would be an external library replacement for cgi.py. One such library is suggested in the cgi.py documentation, Multipart [1] for POST requests, which is exactly what we need.

[1] https://pypi.org/project/multipart

However, external libraries are not an ideal solution, since we need Thinlinc to work on multiple platforms, and thus we would need to make sure that the library is available on all platforms.

Additionally, in the case of the Multipart-library, we do not actually need to support multipart-requests at the moment.
Comment 9 Alexander Zeijlon cendio 2023-09-04 14:43:38 CEST
Option 3 would be to write our own HTTP-API.

At the moment, this API would only need to support non-multipart POST-requests to replace the functionality we are using from cgi.py

All such requests from e.g. tlwebadm always have the same format, only posting form data as key-value pairs in clear text format.
Comment 10 Alexander Zeijlon cendio 2023-09-04 14:51:17 CEST
Since our use case is relatively simple at the moment, we will select Option 3 as our way forward.

Implementing a simple API that currently only needs to handle our tlwebadm and tlwebaccess POST-requests. But that can also be easily extended with more functionality when needed in the future.
Comment 12 Linn cendio 2023-09-06 08:33:46 CEST
The issue that Cheetah uses cgi has been reported upstream:
Comment 15 Alexander Zeijlon cendio 2023-09-07 17:06:19 CEST
The CGI python module (cgi.py) has now been replaced with our own implementation for parsing POST requests.

It is more limited in functionality and scope than the CGI module in that it does not serve as a general purpose request handler. It is instead written to provide the subset of features we need, that are found in the CGI module.

In practice, this means that we assume that the POST request to be processed has a URL-encoded body:

* If the request header has Content-Type set to application/x-www-form-urlencoded
  we read and parse the request body accordingly.
  - If there is no Content-Type set, we set to application/x-www-form-urlencoded
    and try to parse it anyway [1].
* If the request header has Content-Type set to something else, we do not parse
  the request body.

[1] This behavior also exists in the CGI module.

This means that we do not support parsing of multipart POST requests, which is something that can be done when using the CGI module.

For now, our limited implementation is sufficient, since we do not provide the option for users to upload large files through tlwebadm and tlwebaccess.
Comment 16 Linn cendio 2023-09-08 17:37:56 CEST
I looked into the usage of cgi.Fieldstorage in tlclient.cgi, and in this file we call FieldStorage without any arguments, meaning that it only uses default values or values from environment variables.

Since FieldStorage in many ways worked as a dict, I tried replacing it with an empty dict to start and replaced getvalue(<val>) with get(<val>, None). I don't get any errors, but it seems like parameters that are retrieved with get_cgi_bool() does not work as before.

The properties we use get_cgi_bool() on are of type <input type="hidden"> in our html, and my theory is that the name hints about that FieldStorage did something magical to these values. However, there are also other hidden input elements that are not bools that are retrieved with a simple getvalue(). More investigation is needed.
Comment 17 Linn cendio 2023-09-11 12:45:08 CEST
Looked into get_cgi_bool(), and despite the name we don't do anything magical there, just another getvalue() call.

The constructor for FieldStorage also takes 3 environment variables into consideration:

For CONTENT_LENGTH and CONTENT_TYPE, we want to add these value into Response.http_headers to mimic the headers we would get from FieldStorage. Regarding QUERY_STRING, this variable is only used for urlencoded post requests [1], but since this variable is empty it is not interesting for us.

[1]: https://github.com/python/cpython/blob/3.12/Lib/cgi.py#L605
Comment 18 Linn cendio 2023-09-11 13:05:27 CEST
FieldStorage also uses sys.stdin.buffer as the default "file" to read if no other file is provided, which is the case for us in tlclient.cgi. 

Tested if stdin buffer was used by replacing it with an empty BufferedReader in cgi.py, and could not retrieve the values from <input> then. Hence, the <input> values are coming from the stdin buffer.

The way FieldStorage handles urlencoded input is to just call urllib.parse.parse_qsl on the file (i.e. stdin buffer in our case). The difference between parse_qsl and parse_qs is just the output format (list of tuples vs dict), so we should hopefully just be able to reuse our function for parsing post request data when parsing the stdin data.
Comment 26 Alexander Zeijlon cendio 2023-09-12 14:34:42 CEST
The HeaderDict class has now been updated with a constructor that will properly capitalize its dictionary keys when it is initialized as a copy of another container.

To achieve the same effect before this update, first an empty HeaderDict had to be initialized, and then key-values had to be added to it one at a time, e.g. in a for-loop.
Comment 27 Alexander Zeijlon cendio 2023-09-12 14:50:36 CEST
Errors originating from receiving a bad request no longer raise an exception. Instead, parse_post_request now returns None, signaling that the request could not be parsed properly. A caller can then act based on this and e.g. return "400 Bad request" to the user agent.
Comment 30 Alexander Zeijlon cendio 2023-09-12 15:51:44 CEST
We also decided to mimic the behavior of cgi.FieldStorage about the Content-Length header; If the Content-Length header is not in the input, we assume that the input buffer, representing the request body, should be read until EOF.

This also means that we need to consider if the Transfer-Encoding header is in the input, since it also concerns reading (parts of) the body. For now, we raise an exception if the Transfer-Encoding header is present, since this is something we have not implemented yet.
Comment 31 Alexander Zeijlon cendio 2023-09-12 16:03:46 CEST
According to the specification [1], handling both Content-Length and Transfer-Encoding during the same request could have security implications.

We have therefore future-proofed parse_post_request by having it return None, signaling "400 Bad request" if both headers are found in the input. This is also supported by the specification [2]. This will become more relevant when the exception discussed in comment 30 is replaced by an implementation for Transfer-Encoding.

[1] https://www.rfc-editor.org/rfc/rfc9112#section-6.1-14
[2] https://www.rfc-editor.org/rfc/rfc9112#section-6.2-2
Comment 32 Linn cendio 2023-09-12 16:58:40 CEST
The dependency on the cgi modul in tlclient.cgi has now been removed. Since we want to move tlclient.cgi out of ThinLinc eventually, we used a separate parsing function in this file to not have any lingering ThinLinc dependencies.

The parsing function in tlclient.cgi is based on parse_post_request() in httpserver, but it has been tweaked to better fit its use in tlclient.cgi. Our use case for cgi.FieldStorage() in this file was different from the one in webadmin and webaccess. Instead, the FieldStorage in tlclient.cgi was dependent on the value of some environment variables (CONTENT_LENGTH, CONTENT_TYPE and QUERY_STRING). 

To mimic the previous behaviour, these environment variables were taken into consideration in our parsing function.
Comment 33 Linn cendio 2023-09-12 17:00:08 CEST
Tested the changes in tlclient.cgi by comparing the content of the downloaded files before and after the changes, and they looked identical. Tested the following scenarios:

1) Visited <url>/thinlinc/tlclient.cgi, wrote name and password into the respective field and pressed "Login". File should contain the username and password in the corresponding variable.

2) Visited e.g. <url>/thinlinc/tlclient.cgi?server_name=asb&username=abc. Did not changing anything in the form and pressed "Login". File should contain the corresponding server name and username. All variables except REMOVE_CONFIGURATION can be set using a query string like this.

3) Visited e.g. <url>/thinlinc/tlclient.cgi?server_name=asb&username=abc. Changed the username in the form and pressed "Login". Could see that the user name from the form had been prioritised.

4) Visited <url>/thinlinc/cgitest.html, pressed "Submit". Could see that the file contained the same value as present in the form.

5) In tlwebadmin, I went to Status -> Sessions, chose a running session and clicked "Shadow session". The downloaded file had shadowing enabled and the username of the session as the user to be shadowed. The page also stayed on the chosen session after downloading.
Comment 34 Alexander Zeijlon cendio 2023-09-12 17:00:34 CEST
Testing, for the parts that do not concern the tlclient.cgi script.

> * There should be no regressions in functionality for POST events done to our
>   tlwebadm and tlwebaccess when replacing cgi.py with a new HTTP-API.
Unit tests have been added that verify the behavior of the new parse_post_request function, where expected outputs have been verified to be equivalent to the old implementation that uses cgi.FieldStorage to parse POST requests.

I have also tested that posting through tlwebadm and tlwebaccess works as expected after the changes were introduced.

> * If the new HTTP-API does not cover a certain type of POST request/event, it
>   should be made clear that this is not supported.
In cases where we cannot process a request due to a missing feature, an exception is raised, to make this clear to the developer.

> * The new HTTP-API should be written with extensibility in mind.
The new HTTP-API consists of a single function for handling POST requests that only implements the features we need right now. Hence, we have not made any decisions on how to handle other request methods.

Furthermore, since it is not overly complicated, extending it should be easy in the future if/when we need e.g. support for the Transfer-Encoding header or the multipart Content-Type.
Comment 35 Linn cendio 2023-09-12 17:20:43 CEST
Updating acceptance criteria to include:
> * tlclient.cgi should not depend on any modules provided by ThinLinc.
It doesn't, it only depends on modules in Python's standard lib.

Also removing the EXTRA criterion:
> * All features of cgi.py should be supported in our HTTP-API.
This will not be taken care of in this bug, but instead moved to a separate bug.

With that, all parts of this bug have been taken care of, marking as resolved.
Comment 37 Samuel Mannehed cendio 2023-09-15 16:29:43 CEST
Verified using jenkins build 3381 on RHEL 9.

> * There should be no regressions in functionality for POST events done to our tlwebadm and tlwebaccess when replacing cgi.py with a new HTTP-API.
They still work well. I tested logging in through Web Access, then terminating the session using Web Admin. In Web Access, I then returned to the login page. Thereafter, I stopped, and started the vsmserver service using Web Admin.

> * If the new HTTP-API does not cover a certain type of POST request/event, it should be made clear that this is not supported (and needs to be implemented).
An exception is raised in this case, code changes here look good.

> * The new HTTP-API should be written with extensibility in mind.

> * tlclient.cgi should not depend on any modules provided by ThinLinc.
The Python file ctc/vsm/tlclient.cgi doesn't import any thinlinc modules. However, it does assume that the file opt/thinlinc/etc/tlclient.conf.webtemplate is present on the machine running it, I'll make a note on bug 6082 for this.

I also checked all the commits, I had a few comments, but nothing major. The changes look good.

Lastly, I checked the source tree for imports of cgi. Cheetah/Tests/Regressions.py still imports cgi, but that is in order to support Python versions older than 3.8:

> try:
>     from cgi import escape as html_escape
> except ImportError:  # Python 3.8+
>     from html import escape as html_escape
This code is still safe for the Python versions we care about (> 3.4).
Furthermore, I would assume we don't run Cheetah's tests.

All good, closing.

Note You need to log in before you can comment on or make changes to this bug.