> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.
What do you mean by "free"? Rust requires you to explicitly convert a string to bytes or vice versa, no? Which is pretty much what you do in Python - the only difference I can see is that you have shortcut methods to encode/decode using UTF-8, but semantically they're no different from encode/decode in Python.
I'm pretty dubious about specifying that the internal representation must be UTF-8. That's a failure of abstraction (because the program shouldn't know or care what the internal representation is), leads to inherent performance/interop problems on several compile targets (Windows, the JVM, Javascript), and seems to imply that Han unification is forced at the language level.
str -> [u8] is free from a performance perspective. It is internally equivalent to a type cast.
[u8] -> str requires a UTF-8 validity check, but is otherwise also internally equivalent to a type cast (i.e., no allocations). I assume this is what Armin meant by "almost" free.
FWIW, I do think that "internally and externally UTF-8" is the best approach to take. If Rust's string type used, say, a sequence of 32-bit codepoints instead, then lots of lower level string handling implementations would be quite a bit slower than their UTF-8 counterparts. (For at least a few reasons that I can think of.) UTF-8 also happens to be quite practical from a performance perspective because it lets you reuse highly optimized routines like memchr in lots of places.
In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.
You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.
With an opaque string type there's nothing stopping a particular Python implementation from using UTF-8 as an internal representation - it would likely perform worse than CPython at iterating over the code units of a string, but that's likely an acceptable cost. Particularly for a language like Python, defining the precise performance characteristics is rarely the priority, especially if it comes at the cost of confusing the semantics.
> In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.
> You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.
I'd argue that offering APIs that can panic is a poor tradeoff in a default/general-use/beginner-facing type. There's maybe a place for a type that implements the same traits as strings while also offering unsafe things like indexing by byte offset (if it's really impossible to achieve what's needed in a safe way, which I'm dubious about), but it's a niche one for specialist use cases (even if it might be the same underlying implementation as the "safe" string type).
I feel like you picked at the least interesting aspects of my comment. It continues to be frustrating to talk to you. :-(
And yes, you can index by byte offset in a zero cost way by converting the string to a byte slice first.
Have you used Rust strings (or any similarly designed string abstraction) in anger before? It might help to get some boots-on-the-ground experience with it.
What do you mean by "free"? Rust requires you to explicitly convert a string to bytes or vice versa, no? Which is pretty much what you do in Python - the only difference I can see is that you have shortcut methods to encode/decode using UTF-8, but semantically they're no different from encode/decode in Python.
I'm pretty dubious about specifying that the internal representation must be UTF-8. That's a failure of abstraction (because the program shouldn't know or care what the internal representation is), leads to inherent performance/interop problems on several compile targets (Windows, the JVM, Javascript), and seems to imply that Han unification is forced at the language level.