> Perhaps my least favorite feature of Python 3 is its insistence that the world is Unicode. [..] However, the approach of assuming the world is Unicode is flat out wrong and has significant implications for systems level applications (like version control tools).
Isn't this more a problem with Python not easily differentiating between String and Byte types? Both Go and Rust ("""systems""" level languages) have decided that "utf-8 ought to be enough for anybody" and that seems to be a good decision.
Yes, but that insistence that Bytes and Unicode are two different things that Shall Not Be Mixed was mostly a Python 3-ism. Python 2 had different types but you could be sloppy and it would kinda work out.
There was this assumption that Unicode code points were the correct single unit to talk about Unicode. You iterate over code points, you talk about string lengths in terms of code points, you slice in terms of code points. Much like the infamy of 16-bit Unicode, this is an assumption that has kinda gotten worse over time. Now we can and do want to talk about bytes, code points, and newer sets like extended grapheme clusters. I think this is probably the big failing of Python 3's Unicode model. Making a string type operate on extended grapheme clusters might fix it, but we'd be in for the same sort of pain, and the flexibility of "everything is bytes, we can iterate over it differently" of Go and Rust is much nicer in comparison.
The second thing was this assumption that everything remotely looking like text was Unicode, despite this maybe not being true. HTTP has parts that look like plain text, like "GET" and "POST" and the headers like "Content-Type: text/html". But the correct way to view this as ASCII bytes, and no other encoding makes sense; binary data intermixed with "plain text" definitely happens, and the need to pick and choose between either Unicode or Bytes caused major damage in the standard library which still persists to this day -- some parts definitely chose the wrong side. Take a look at the craziness in the "zipfile" module for one other example. It's probably fixed now, but back then, I basically had to rewrite it from scratch in one of my other projects.
They eventually relented and added back a lot of the conveniences to blur the line between bytes and unicode again, like adding the % formatting operator for bytes, which I think shows that their insistence on separating the two didn't really pan out in practice. And yet, migration is still a pain.
> Python 2 had different types but you could be sloppy and it would kinda work out.
It would "kinda work out", if your Unicode strings were ASCII in practice, and only then. Because whenever a Unicode and a non-Unicode string had to be combined, it used ASCII as the default encoding to converge them.
Which is to say, it only worked out for English input, and even then only until the point where you hit a foreign name, or something like "naïve". Then you'd suddenly get an exception - and it happened not at the point where the offending input was generated, but at the point where two strings happened to be combined.
This was a horrible state of affairs for basically everybody except the English speakers, because there was a lot of Python code out there that was written against and tested solely on inputs that wouldn't break it like that.
Intermixing binary data with text can be represented just fine in a type system where the two are different. For your HTTP example, the obvious answer is that the values that are fundamentally binary, like the method name or the headers, should be bytes, while the parts that have a known encoding should be str - there's nothing there that requires actually mixing them in a single value. In those very rare cases where you genuinely do have something like Unicode followed by binary followed by Unicode in a single value, that is trivially represented by a (str, bytes, str) tuple.
The problem with the Python stdlib isn't that bytes and Unicode are distinct. It's that it's overly strict about only accepting Unicode in some places where bytes should be legal, too. This is orthogonal to them being separate types.
It would still be a mess any time you have to deal with byte strings that aren't UTF-8. The problem is with the implicit conversion itself - it shouldn't happen, because there's no way to properly guess the encoding. But there was no way to get rid of it without breaking things.
That change was at the heart of the breaking changes around strings in Python 3. If the conversions remained implicit, most people would probably have never even noticed that string literals default to Unicode, or that some library functions now require Unicode strings.
> There was this assumption that Unicode code points were the correct single unit to talk about Unicode.
The most messed-up thing about Python 3 is that it's supposed to be justified by doing Unicode right and they still got it wrong.
Having strings be sequences of Unicode code points is a super-bizarre design. That is, Python 3 strings indeed are semantically sequences of Unicode code points rather than sequences of Unicode scalar values. You can not only materialize lone surrogates (defensible for compatibility with UTF-16) but you can also materialize surrogate pairs in addition to actual astral characters. You still can't materialize units that are above the Unicode range, though, so it's not like C++'s std::u32string.
Looking at the old PEPs, it appears to have arisen by accident rather than as an actual design.
I'm confused, there isn't an insistence that everything is unicode. Http headers are treated as bytes before you decode them, but you can totally decode an http request or response as ASCII. At least until you're interacting with a website that has unicode codepoints in it's url.
I think the issue is with people getting used to python 2 approach, where the distinction was between str (bytes) and unicode. In python 3 you should not think of bytes vs unicode, you should think as text vs bytes and you should use text as long as necessary.
BTW: I believe the http headers supposed to be encoded using ISO-8859-1 it's essentially same thing as US-ASCII, but it covers an entire byte.
> Yes, but that insistence that Bytes and Unicode are two different things that Shall Not Be Mixed was mostly a Python 3-ism
Go has string and byte[], and you can't mix it, you have to cast. Java has String, char[] and byte[] and similarly you need to do cast. Rust has Bytes and String (I don't know Rust enough, but I'm pretty sure it doesn't implicit conversion between them).
Also Python 3 doesn't distinct between Bytes and Unicode, Python 3 has distinction between bytes and text (str - BTW: Guido actually expressed regret that he did use "str" instead of "text", because it would be much clearer)
In Python 3 you don't have Unicode (as far as you should be concerned), you have text and bytes, how the bytes are stored internally is an implementation detail, if you need to write to a file or to network, you encode the text using various encodings (most popular is UTF-8) and you decode it back when reading.
Go's string is guaranteed to be a series of bytes, not Unicode code points. I'm unsure about Java. Rust has a more complicated text model that I won't summarize in this post, but it's far better than Python 3's.
> In Python 3 you don't have Unicode (as far as you should be concerned), you have text and bytes
Python 3 strings store Unicode code points. When you iterate over a Python 3 str, you get back Unicode code points. As mentioned elsewhere, this is not a Unicode scalar value, and can include things like unpaired surrogates. This is also not an extended grapheme cluster, which is the current best-effort description as to what counts as a "single character".
So, you really do need to be concerned about what your strings contain. If you don't want people to care, don't give them the ability to iterate, slice, or index into str to retrieve Unicode code points, and leave them as opaque blobs, as some of those other languages do.
> Go's string is guaranteed to be a series of bytes, not Unicode code points. I'm unsure about Java. Rust has a more complicated text model that I won't summarize in this post, but it's far better than Python 3's.
Yes, but at this point you're arguing about implementation details. The idea is that if you use it as a string it is string, if you need bytes, you need to perform a conversion. It shouldn't be your concern how it is stored internally.
If we are going into Python internals, the string can be stored as multiple versions from basic C-string to unicode code points. If you perform conversion it will cache the result so it can be used in other places. I don't remember the details, since I looked at the code long time ago, but it isn't that simple.
I don't know how to explain it any simpler. Iterating over a str type in Python 3 enumerates Unicode code points. The length of a str type is the number of code points it contains. Reversing a str will reverse the Unicode code points it contains (not guaranteed to be a sane operation). Indexing into a str with foo[0] gives you back a str containing a single Unicode code point.
This is not an implementation detail, it is fundamental to how the str type in Python 3 operates. I have not talked at any point about the internal storage of this type, just the interface it publicly exposes.
This is called a leaky abstraction. I can't find a good behavior for a high level language to do it this way. If you use index in a string you always will get something that's invalid, at least in Python or Java you get code points.
Python 3 strs should not be iterated over, sure. Ban that in your linter, then you're in the same position you would be in Rust. It's a misfeature but it's still a detail.
Zipfile has always been a mess. I have no idea why, but its interfaces have been consistently poor from a usability perspective. This well before py3 was a factor.
The blog post talks about this a bit in Rust, but we don't actually say that. We do make that the default, but we also give you the ability to get at the underlying things as well. There's a lot of interesting work here, actually, like WTF-8...
In the wild WTF-8 and its 16-bit equivalent show up more often than you'd expect. I ended up discovering a case recently where part of the .NET executable file format is actually encoding strings as WTF-16 (not UTF-16) and any internal lowering needs to map them to WTF-8 instead of UTF-8. Until that point I had expected to only ever encounter WTF-8 in web browsers!
I would say that it is just shitty design not not differentiate between bytestrings and regular strings in a way that causes problems. The biggest design flaw here was not forcing people to understand the difference in python2
Isn't this more a problem with Python not easily differentiating between String and Byte types? Both Go and Rust ("""systems""" level languages) have decided that "utf-8 ought to be enough for anybody" and that seems to be a good decision.