> but as a "non-Latin" language user, defaulting str to unicode literals is the best change in Python 3
I'm also a "non-latin" user and I will keep repeating this point ad nauseam: there would have been many strictly superious solutions to solving this problem and most of them would have been closer to what we had in Python 2 than 3.
Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.
A Unicode model that was a bad idea in 2005 was picked and we now have it in 2020 where it's a lot worse because thanks to emojis we now are well outside the basic plane.
> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.
Both of those are newer languages that happen to take a stance from the day 1. So not quite comparable.
That said, UTF-8 is one of the best pragmatic solutions to this Unicode problem. Most engineers I meet who throw their hands up in the air complaining about Unicode haven't read the simple Wikipedia page for utf-8.
Python 2 was already half way there, they just to had to tweak a few places bytes are converted to strings. Of course this is easier for newer languages to solve. We can't blame Python for having to provide backward compatibility.
PS: I also blame all the "encoding detection" libraries which exist to try to solve an unsolvable problem. Nobody can detect an encoding, at least not reliably. If these half-assed libraries did not exist, people would have finally settled on UTF-8 and given up on others by now.
> Both of those are newer languages that happen to take a stance from the day 1. So not quite comparable.
Python 3 predates Rust and Go and I can tell you from personal interactions with people how much opposition there was against UTF-8 as either default or internal encoding. A lot of the arguments against it were already not valid then and they definitely are not today.
Python 3 launched despite a lot of vocal opposition against it. I think many do not even remember how badly broken the URL, HTTP and Email modules were when they were first ported to Python 3. There was a complete misunderstanding of how platform abstractions should look like.
No one is complaining that Python 2 didn't DTRT when it comes to Unicode.
But when Python 3 made its decision, it was known to be the wrong thing. People who had done Unicode in other languages told them it was the wrong thing. People who had taken the effort to do Unicode right in Python 2 told them it was the wrong. The only people telling them they were doing the right thing, were Python 2 programmers who thought they were going to get Unicode support for free without thinking about it (or worse, who had done horribly wrong things in Python 2 - the mess PyGTK wrote itself into, for example).
Python 3 has no excuses for what are now often unusable APIs when you truly do need to process binary data. And all we gained is that we don't need to type "u" before some string constants anymore. It wasn't worth it, and it's still not good.
> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.
What do you mean by "free"? Rust requires you to explicitly convert a string to bytes or vice versa, no? Which is pretty much what you do in Python - the only difference I can see is that you have shortcut methods to encode/decode using UTF-8, but semantically they're no different from encode/decode in Python.
I'm pretty dubious about specifying that the internal representation must be UTF-8. That's a failure of abstraction (because the program shouldn't know or care what the internal representation is), leads to inherent performance/interop problems on several compile targets (Windows, the JVM, Javascript), and seems to imply that Han unification is forced at the language level.
str -> [u8] is free from a performance perspective. It is internally equivalent to a type cast.
[u8] -> str requires a UTF-8 validity check, but is otherwise also internally equivalent to a type cast (i.e., no allocations). I assume this is what Armin meant by "almost" free.
FWIW, I do think that "internally and externally UTF-8" is the best approach to take. If Rust's string type used, say, a sequence of 32-bit codepoints instead, then lots of lower level string handling implementations would be quite a bit slower than their UTF-8 counterparts. (For at least a few reasons that I can think of.) UTF-8 also happens to be quite practical from a performance perspective because it lets you reuse highly optimized routines like memchr in lots of places.
In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.
You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.
With an opaque string type there's nothing stopping a particular Python implementation from using UTF-8 as an internal representation - it would likely perform worse than CPython at iterating over the code units of a string, but that's likely an acceptable cost. Particularly for a language like Python, defining the precise performance characteristics is rarely the priority, especially if it comes at the cost of confusing the semantics.
> In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.
> You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.
I'd argue that offering APIs that can panic is a poor tradeoff in a default/general-use/beginner-facing type. There's maybe a place for a type that implements the same traits as strings while also offering unsafe things like indexing by byte offset (if it's really impossible to achieve what's needed in a safe way, which I'm dubious about), but it's a niche one for specialist use cases (even if it might be the same underlying implementation as the "safe" string type).
I feel like you picked at the least interesting aspects of my comment. It continues to be frustrating to talk to you. :-(
And yes, you can index by byte offset in a zero cost way by converting the string to a byte slice first.
Have you used Rust strings (or any similarly designed string abstraction) in anger before? It might help to get some boots-on-the-ground experience with it.
> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.
Do you mean that if you have bytes, but you want to send them to a function that expects a string, then it would automatically interpret the bytes as UTF-8?
If so, that violates the "Explicit is better than implicit" part of the Zen of Python. Encoding/Decoding bytes to/from strings shouldn't happen automatically because doing so means you have to make an assumption about the encoding.
> Do you mean that if you have bytes, but you want to send them to a function that expects a string, then it would automatically interpret the bytes as UTF-8?
No, the types are separate and not implicitly converted P2-style, however "unicode strings" are guaranteed to be proper UTF8 so encoding to UTF-8 is completely free, and decoding from UTF8 just requires validating.
Python's maintainers rejected this approach because "it doesn't provide non-amortised O(1) access to codepoints", and while Python 3 broke a lot of things they sadly refused to break this one completely useless thing, only to have to come up with PEP 393 a few years later.
Also, as explained in those docs, if and when you are absolutely sure that the Vec or slice of bytes is valid UTF-8, you could use the following "unsafe" methods to not incur the overhead of validation (warnings in the docs):
IMO Python is doing exactly the same thing that Go does (I know too little about Rust to comment) the only difference is that Python respects the LANG variable while Go is just fixed on using UTF-8.
> Python is doing exactly the same thing that Go does
It doesn't. Go's internal string encoding is UTF-8 and it can even be malformed. Go in fact does pretty much what Python 2's byte strings were just that string operations such as converting to uppercase or iteration over runes understands UTF-8 and Unicode.
Here's your problem: you should not care how python is representing it internally.
> Go in fact does pretty much what Python 2's byte strings were just that string operations such as converting to uppercase or iteration over runes understands UTF-8 and Unicode.
Why do you care about internal representation though, what are you gaining if Go's string and Python's str can express all characters. In Go you still need to convert string into byte[] when doing I/O.
I'm also a "non-latin" user and I will keep repeating this point ad nauseam: there would have been many strictly superious solutions to solving this problem and most of them would have been closer to what we had in Python 2 than 3.
Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.
A Unicode model that was a bad idea in 2005 was picked and we now have it in 2020 where it's a lot worse because thanks to emojis we now are well outside the basic plane.