8240 – Files and subprocess use wrong encoding in future Python

Bug 8240 - Files and subprocess use wrong encoding in future Python

Summary: Files and subprocess use wrong encoding in future Python

Status:	NEW

Alias:	None

Product:	ThinLinc
Classification:	Unclassified
Component:	Other (show other bugs)
Version:	trunk
Hardware:	PC Unknown

Importance:	P2 Normal
Target Milestone:	LowPrio
Assignee:	Bugzilla mail exporter

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-10-19 10:33 CEST by Pierre Ossman
Modified:	2023-10-24 13:15 CEST (History)
CC List:	0 users

See Also:
Acceptance Criteria:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Pierre Ossman cendio

2023-10-19 10:33:46 CEST

Python has signalled that they will likely change the default encoding for files and subprocesses to UTF-8 in the future, regardless of the configured locale:

https://peps.python.org/pep-0597/

The rationale is that many files are meant to be shared, so they should use a common encoding rather than a system specific one.

The vast majority of systems already use UTF-8, so in practice this will change very little. It can affect the few users that use a non-standard encoding, though.

In ThinLinc, we've been very careful about specifying the correct encoding when needed, so everything is designed with the assumption that Python will respect the system encoding if none is explicitly specified. If Python changes this default behaviour, then many of our calls need to be adjusted.

For convenience, they've added the special "locale" encoding in these cases¹ so you don't have to look up the actual encoding in every place. However, that support is not until Python 3.10, which means it is likely we will have a period where we need to support both newer Pythons with changed default, and older Pythons without encoding="locale" support.

¹ It's not a standard codec though, so it cannot be used everywhere.

Comment 1 Pierre Ossman cendio

2023-10-19 13:43:12 CEST

To convolute things even further, they've changed what "current locale" means in slightly Python 3.11. Previously, it meant "locale.getpreferredencoding(False)", but they've now changed it to "locale.getencoding()".

The difference between the two isn't terribly clear, and the documentation just states that they are basically the same, except that UTF-8 mode is ignored in the new one. Which seems like an odd minor difference.

To make things extra confusing, this is how the new handling is implemented in subprocess:

>     if sys.flags.utf8_mode:
>         return "utf-8"
>     else:
>         return locale.getencoding()

Which makes it look like it is basically doing the same thing as before, i.e. "locale.getpreferredencoding(False)".

Comment 2 Pierre Ossman cendio

2023-10-19 13:48:18 CEST

I checked the code¹, and locale.getpreferredencoding() is now only a wrapper around locale.getencoding() that first checks UTF-8. Which makes it super confusing what subprocess is doing.

¹ https://github.com/python/cpython/blob/main/Lib/locale.py

Comment 3 Pierre Ossman cendio

2023-10-19 13:50:13 CEST

Python has added a new warning for this transition, but left that warning disabled by default. They indicate that they'll upgrade that warning to a deprecation warning at some point.

One thing to note is that the warning isn't present in just the file and subprocess handling, but also in locale.getpreferredencoding() as well. So I guess those are meant to become deprecated as well?

Comment 4 Pierre Ossman cendio

2023-10-19 15:47:55 CEST

I found the PEP for this change:

https://peps.python.org/pep-0686/

They've apparently decided to change the default in Python 3.15, which should be released in 2026.

Unfortunately for us, we won't be able to raise our requirements to Python 3.10 until 2032 when RHEL 9 is EOL.

So the situation is that doing open() on a system with LC_CTYPE=latin1 will give you:

Python < 3.15: latin1
Python >= 3.15: UTF-8

With LC_CTYPE=C it gets more complex because of UTF-8 mode:

Python < 3.7: ASCII
Python >= 3.7, < 3.11: UTF-8
Python >= 3.11, < 3.15: ASCII
Python >= 3.15: UTF-8

Comment 5 Pierre Ossman cendio

2023-10-19 17:27:46 CEST

Note that we mimic Python's subprocess.Popen handling of encodings in our extproc.subprocess_run(). We might want to mimic whatever Python is doing here as well, when it comes to chosen encoding and warnings.

Note You need to log in before you can comment on or make changes to this bug.