Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I understand author's reasoning in the context of a transition, but as a "non-Latin" language user, defaulting str to unicode literals is the best change in Python 3. Coming from C#, I never get used to Python 2's approach. It's a pain in the ass working with non-Latin characters in Py2 starting from simply output in console, especially on Windows.

>assuming the world is Unicode is flat out wrong

True, but Py2's approach makes lots of developers assume the world is Latin-1. I see way too many examples of things broken on a Chinese locale environment, including Python's official IDLE ([1]).

[1] https://bugs.python.org/issue15809 (Summary of this bug: in 2.x IDLE, an explicit unicode literal used to still be encoded using system's ANSI encoding instead of, well, unicode.)



The most amusing quote in the entire article is this (emphasis mine):

> This ground rule meant that a mass insertion of b'' prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn't have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.

Requiring developers to think which one it should be is, of course, the whole point of the changes in Python 3 - and it's what produces better apps that are more aware of i18n issues in general and Unicode in particular.

And the complaint doesn't even make sense if taken at face value - if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else. Of course, the devil is in the details, which is reflected by the word "practically" in that sentence - this kinda implies that there are places where Unicode strings are used. At which point you do want the developers to think about bytes vs Unicode.

So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated - because it has to be explicit now, instead of the Python 2 world, where bytes was the default, and Unicode had to be requested explicitly. Which, of course, is the right change for the vast majority of code out there, that operates on higher level of abstraction, where "all strings are Unicode by default" is a perfectly reasonable assumption to force.


> And the complaint doesn't even make sense if taken at face value - if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else.

The article directly answers that question. Many, many things in the standard library now only accept unicode strings, not byte strings. So a wholesale change to b'' everywhere breaks lots of stuff.

> So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated - because it has to be explicit now, instead of the Python 2 world, where bytes was the default, and Unicode had to be requested explicitly.

Once again, the article directly states that the default is not the problem. The lack of escape hatches is. Paths are not unicode strings, and pretending they are does not work. Using bytes when you need bytes works only until you need to call a library function that only accepts strings.


Paths ARE Unicode strings on 99% of the computers with humans sitting in front of them. NTFS, HFS+, and APFS all use Unicode but more importantly, the experience of not using valid Unicode where that’s possible is horrible: undeletable files, crashes, etc. I’ve seen that many times over the years (it was popular with malware authors) but never a time where this was a desirable behavior.

The default should always be Unicode with only people writing low-level backup and security tools dealing with bytes.


This just isn't true. In Windows paths are UCS2 i.e. arbitary sequences of unicode code units, inclusing unpaired surrogates. This means that there are paths that will work on Windows but cannot be encoded as e.g. valid UTF-8. As a result Rust has a bespoke encoding just for representing Windows paths in a way that's compatible with UTF-8 ("WTF-8"). It also means that you can't make a guaranteed lossless conversion from a filesystem path to a Rust string; you have to handle the possibility of errors.

On Mac paths are some weird NFKD-ish thing, so equality comparisons are complicated.

As a rule, if you think that filesystem paths as easy then you're probably ignoring all the edge cases. In application where you don't deal with arbitary user files that's fine. In a programming language that's a huge design error.


This all - including complicated equality comparisons - is why paths should have their own dedicated type, and not just be raw strings. Thankfully, Python has had pathlib for a while now.


Paths are Unicode strings on Windows. Yes, POSIX adds a lot more spice to the mix, but if the intent is a cross-platform tool, then Unicode is a reasonable lowest-common-denominator assumption for filenames in 2020.


Paths are Unicode strings everywhere but Unix/Linux. And I would even argue that this is a broken aspect of POSIX today. We should make Unicode the baseline for paths in POSIX-compliant systems, but there's probably too much hand-wringing for that to ever happen.


Paths are sequences of 16-bit values on Windows, not necessarily valid UTF-16. It's basically the same as in POSIX, just one byte wider per character.


> if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else.

The author explains later in the article that many system level python 3 apis that are important to a vcs require unicode and won't accept bytes. So apparently it wasn't as easy as just sticking 'b' in front of every literal.


Right. But that's a very different issue, and it's not at all about string literals as such.

Furthermore, the way they solve it - by using their own wrapper helpers that allow bytes - means that the end result should be b'' throughout, no?


>> So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated

The author made it clear. The issue wasn't just that the default changed. It was that 3.0 took away the ability to always make your choice explicit.

Changing the default would have no effect on code that was always explicit. Going over the code and making all implicit strings explicit would allow them to know when they had full coverage, and also make the code work with both 2 and 3.

With 3, any implicit had to get b added, while any string with u had to be made implicit (drop the u). You couldn't tell by looking at code if it was converted or not. At least that's how I read it.


The lack of u'' in early versions of Python 3 is a valid complaint, but it's a separate one.

It's also not that big of a deal in practice, because you could always write a helper function like u('foo') that would call unicode() on Python 2, and just pass the value through on Python 3. This only breaks when you need a Unicode literal with actual Unicode characters inside, which is a rare case - and should be especially rare in something like Mercurial.


Another reason the complaint doesn't make sense is that the author then praises Rust which is more similar to Python 3 than 2.


From other comments the annoyances for the author were about the standard library using Unicode for system level API; Rust had a OSString type that works with the GIGO model of posix


> but as a "non-Latin" language user, defaulting str to unicode literals is the best change in Python 3

I'm also a "non-latin" user and I will keep repeating this point ad nauseam: there would have been many strictly superious solutions to solving this problem and most of them would have been closer to what we had in Python 2 than 3.

Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

A Unicode model that was a bad idea in 2005 was picked and we now have it in 2020 where it's a lot worse because thanks to emojis we now are well outside the basic plane.


> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

Both of those are newer languages that happen to take a stance from the day 1. So not quite comparable.

That said, UTF-8 is one of the best pragmatic solutions to this Unicode problem. Most engineers I meet who throw their hands up in the air complaining about Unicode haven't read the simple Wikipedia page for utf-8.

Python 2 was already half way there, they just to had to tweak a few places bytes are converted to strings. Of course this is easier for newer languages to solve. We can't blame Python for having to provide backward compatibility.

PS: I also blame all the "encoding detection" libraries which exist to try to solve an unsolvable problem. Nobody can detect an encoding, at least not reliably. If these half-assed libraries did not exist, people would have finally settled on UTF-8 and given up on others by now.


> Both of those are newer languages that happen to take a stance from the day 1. So not quite comparable.

Python 3 predates Rust and Go and I can tell you from personal interactions with people how much opposition there was against UTF-8 as either default or internal encoding. A lot of the arguments against it were already not valid then and they definitely are not today.

Python 3 launched despite a lot of vocal opposition against it. I think many do not even remember how badly broken the URL, HTTP and Email modules were when they were first ported to Python 3. There was a complete misunderstanding of how platform abstractions should look like.

All of this was known back then.


Is there any hope of "fixing" it now without going through another massive migration struggle (which will simply not happen)?


No one is complaining that Python 2 didn't DTRT when it comes to Unicode.

But when Python 3 made its decision, it was known to be the wrong thing. People who had done Unicode in other languages told them it was the wrong thing. People who had taken the effort to do Unicode right in Python 2 told them it was the wrong. The only people telling them they were doing the right thing, were Python 2 programmers who thought they were going to get Unicode support for free without thinking about it (or worse, who had done horribly wrong things in Python 2 - the mess PyGTK wrote itself into, for example).

Python 3 has no excuses for what are now often unusable APIs when you truly do need to process binary data. And all we gained is that we don't need to type "u" before some string constants anymore. It wasn't worth it, and it's still not good.


> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

What do you mean by "free"? Rust requires you to explicitly convert a string to bytes or vice versa, no? Which is pretty much what you do in Python - the only difference I can see is that you have shortcut methods to encode/decode using UTF-8, but semantically they're no different from encode/decode in Python.

I'm pretty dubious about specifying that the internal representation must be UTF-8. That's a failure of abstraction (because the program shouldn't know or care what the internal representation is), leads to inherent performance/interop problems on several compile targets (Windows, the JVM, Javascript), and seems to imply that Han unification is forced at the language level.


str -> [u8] is free from a performance perspective. It is internally equivalent to a type cast.

[u8] -> str requires a UTF-8 validity check, but is otherwise also internally equivalent to a type cast (i.e., no allocations). I assume this is what Armin meant by "almost" free.

FWIW, I do think that "internally and externally UTF-8" is the best approach to take. If Rust's string type used, say, a sequence of 32-bit codepoints instead, then lots of lower level string handling implementations would be quite a bit slower than their UTF-8 counterparts. (For at least a few reasons that I can think of.) UTF-8 also happens to be quite practical from a performance perspective because it lets you reuse highly optimized routines like memchr in lots of places.

In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.

You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.


With an opaque string type there's nothing stopping a particular Python implementation from using UTF-8 as an internal representation - it would likely perform worse than CPython at iterating over the code units of a string, but that's likely an acceptable cost. Particularly for a language like Python, defining the precise performance characteristics is rarely the priority, especially if it comes at the cost of confusing the semantics.

> In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.

> You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.

I'd argue that offering APIs that can panic is a poor tradeoff in a default/general-use/beginner-facing type. There's maybe a place for a type that implements the same traits as strings while also offering unsafe things like indexing by byte offset (if it's really impossible to achieve what's needed in a safe way, which I'm dubious about), but it's a niche one for specialist use cases (even if it might be the same underlying implementation as the "safe" string type).


I feel like you picked at the least interesting aspects of my comment. It continues to be frustrating to talk to you. :-(

And yes, you can index by byte offset in a zero cost way by converting the string to a byte slice first.

Have you used Rust strings (or any similarly designed string abstraction) in anger before? It might help to get some boots-on-the-ground experience with it.


> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

Do you mean that if you have bytes, but you want to send them to a function that expects a string, then it would automatically interpret the bytes as UTF-8?

If so, that violates the "Explicit is better than implicit" part of the Zen of Python. Encoding/Decoding bytes to/from strings shouldn't happen automatically because doing so means you have to make an assumption about the encoding.


> Do you mean that if you have bytes, but you want to send them to a function that expects a string, then it would automatically interpret the bytes as UTF-8?

No, the types are separate and not implicitly converted P2-style, however "unicode strings" are guaranteed to be proper UTF8 so encoding to UTF-8 is completely free, and decoding from UTF8 just requires validating.

Python's maintainers rejected this approach because "it doesn't provide non-amortised O(1) access to codepoints", and while Python 3 broke a lot of things they sadly refused to break this one completely useless thing, only to have to come up with PEP 393 a few years later.


Ah, that makes sense. Thank you for the clarification.


To add to your earlier dialog partner, here are the doc pages for the relevant Rust functions/methods, embedded with runnable examples:

https://doc.rust-lang.org/std/string/struct.String.html#meth...

https://doc.rust-lang.org/std/string/struct.String.html#meth...

https://doc.rust-lang.org/std/primitive.str.html#method.as_b...

https://doc.rust-lang.org/std/str/fn.from_utf8.html

Also, as explained in those docs, if and when you are absolutely sure that the Vec or slice of bytes is valid UTF-8, you could use the following "unsafe" methods to not incur the overhead of validation (warnings in the docs):

https://doc.rust-lang.org/std/string/struct.String.html#meth...

https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.htm...


IMO Python is doing exactly the same thing that Go does (I know too little about Rust to comment) the only difference is that Python respects the LANG variable while Go is just fixed on using UTF-8.


> Python is doing exactly the same thing that Go does

It doesn't. Go's internal string encoding is UTF-8 and it can even be malformed. Go in fact does pretty much what Python 2's byte strings were just that string operations such as converting to uppercase or iteration over runes understands UTF-8 and Unicode.


Here's your problem: you should not care how python is representing it internally.

> Go in fact does pretty much what Python 2's byte strings were just that string operations such as converting to uppercase or iteration over runes understands UTF-8 and Unicode.

Why do you care about internal representation though, what are you gaining if Go's string and Python's str can express all characters. In Go you still need to convert string into byte[] when doing I/O.


Python 2's approach was bad, no argument, but the transition plan for 2-to-3 just didn't work. They thought everyone would run 2to3 in a big bang, and then we'd all switch over to 3 in a few years. Instead it dragged out over a decade because in reality we needed to write code that was compatible with both 2 and 3 (the "6" approach) until enough things were on 3 to drop 2 support.

Hindsight is 20/20 naturally, but in retrospect, they should have just made `bytes` into the name for old `str` and used `from __future__ import` to create a gradual system for moving from 2 to 3 instead of a big bang "we'll break everything once and then never again".


I'm not sure they really thought 2to3 would be used for a big bang. I seem to recall the general initial messaging was that Python 3 was a new language and you would need to do a language port to get to it.


> I understand author's reasoning in the context of a transition, but as a "non-Latin" language user, defaulting str to unicode literals is the best change in Python 3

I think this is misreading the author's criticism. The fact that string literals are now Unicode is not the fundamental problem; the fact that standard library APIs that formerly took bytes now incorrectly take Unicode strings is the problem.

IMO it's great that the world is moving towards opaque blobs of Unicode for strings, but that requires understanding when something shouldn't simply be a string in the first place (for reasons of legacy or otherwise).


My comment is about this sentence:

>Perhaps my least favorite feature of Python 3 is its insistence that the world is Unicode

>standard library APIs that formerly took bytes now incorrectly take Unicode strings

What do you mean by "incorrectly"?


POSIX APIs take bytes, generally. Python wraps these APIs to take unicode and doesn't allow you to pass bytes, even if you need to. Filenames, for example, are just bytes, and if you force them to always be valid unicode you will make it so that you can't interact with files that have names that aren't valid unicode. That's just one example.



An extremely frustrating part of the Python 3 migration is how many times Python module maintainers have had to hear "oh, now it's safe to migrate." This page currently leads off with a comment saying it's been fine any time since 3.4. You say 3.6. When I was maintaining a popular Python module, I heard the same at 3.1, and 3.2. (I didn't maintain it long after that.)


There are very few places where the bytes/string difference matters for posix paths. Python is far from the only popular tool to assume paths must be valid unicode.


> There are very few places where the bytes/string difference matters for posix paths.

It's nothing to do with "places", points in your program, or entry points into the stdlib. It's entire about what path names you need to process, and for large classes of software you have zero control over that. If you have a path that doesn't encode properly with your LC_CTYPE, you're in for a bad time with Python 3. (Of course you won't if you control all your own path names, but then you also don't have a problem assuming and enforcing ASCII.)

People were still migrating home systems to Unicode-compatible encodings long after Py3 came out. I still find files in archives with paths in weird (and undeclared/undeclarable) encodings. Lots of people had such files; non-native English speakers were the most likely to have them.

> Python is far from the only popular tool to assume paths must be valid unicode.

It and Java are the only ones I use regularly. Java doesn't have a good reputation for playing well with the outside world, vs. Python which had been sold for years as "better shell scripts."


> There are very few places where the bytes/string difference matters for posix paths.

There’s only every single input from the system at large, no big.


I don't quite agree. There's lots of systems where it's always unicode, and the a lot of systems where it's always ASCII, and then some systems where stuff is weird (and should be unicode :x)


There was a different API to get this behavior since 3.4: https://www.python.org/dev/peps/pep-0428/#id39


Which means it's been true (and broken) for many many years until maintainers finally succumbed to external pressure and unbroke the API.


Just beware that C# is not exactly "Unicode" either.

C# char is a UTF-16 code unit, not a Unicode code point.

Most code points "fit" into just one UTF-16 code unit, but not all.

For example: 𝐀 ("Mathematical Bold Capital A", code point U+1D400) is encoded in UTF-16 as a surrogate pair of code units: U+D835 and U+DC00. So reversing "x𝐀y" should produce "y𝐀x" ("y\ud835\udc00x") - note how U+D835 and U+DC00 were not reversed in the result.


C# isn't exactly quiet about this property, and yes, it can be annoying from an API perspective, but in C# this was likely a pragmatic choice to remain compatible (and familiar) with C++, COM, etc. where most developers would be coming from.

API members that operate on code points universally take a string and an index.

That being said, treating strings as arrays of characters is fraught with peril in most cases anyway. You can't trivially reverse strings in any encoding, as you need to reverse the sequence of grapheme clusters (to account for diacritics, etc.). You can't trivially truncate strings either, for pretty much the same reason. You can't trivially grab a single character from the middle of a string, again, for the same reason. So basically, indexing, reversing, truncating, copying a subsequence, etc. are all not trivially possible regardless of the encoding. UTF-16 is not the main problem here, as even in UTF-32 it'd be broken.


I think the actual pain in Python 2 came from the misguided decision not to adopt UTF-8 as the default character encoding, combined with silent coercion between unicode/bytes whenever needed. Those two features in combination made Python brittle and dangerous when handling non-ascii characters, not the "strings are bytes" default.

Making strings Unicode by default is wonderful compared to the alternatives (and OP's assertion that this amounts to "assuming the world is Unicode" is disingenuous: there's nothing stopping programs from handling bytes correctly - Python 3 merely resolved the ambiguity).


> I think the actual pain in Python 2 came from the misguided decision not to adopt UTF-8 as the default character encoding

The decision of a default encoding surely dates back to Python 1.0 or earlier, which predates not just UTF-8 but even Unicode itself. Python is an old language!

And if the assertion is that Python 2.0 should have made the tumultuous Unicode jump when it released in 2000, I could get behind that (especially in retrospect!), but enthusiasm for both Unicode and UTF-8 was not nearly as high then as it is today, so I don't begrudge them for not jumping at the opportunity.


Interestingly enough, Ruby 1.8 -> 1.9, the big version jump there, there was this kind of transition. The remainder of this post is all IIRC, it's been a while...

Ruby 1.8 had "everything as bytes" and there was no concept of encodings.

Ruby 1.9 introduced explicit encodings on every string. By default, strings would be encoded as the same encoding as your source file. The default was ASCII. You could control this explicitly with a magic comment, and so many folks added the "UTF-8" comment, to get strings encoded as utf-8 by default.

Ruby 2.0, which was not as large a transition as Ruby 1.8 -> 1.9, even though it sounds like a larger one, said that encodings of files were UTF-8 by default, and therefore, strings generally became UTF-8 by default as well. Most folks just removed their magic comments.


It's surprising how many people believe that you can use a magic comment to make Python use UTF-8 encoding as the default. All the magic comment affects is the encoding of the source file, not the run-time.


Enforcing UTF-8 as the default encoding, barring a magic comment otherwise, would hardly have been the biggest compatibility break in the 2.x line. It could have been done in any minor release, IMO.


To be fair, IDLE is pretty garbage in most ways.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: