Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

> Go's string is guaranteed to be a series of bytes, not Unicode code points. I'm unsure about Java. Rust has a more complicated text model that I won't summarize in this post, but it's far better than Python 3's.

Yes, but at this point you're arguing about implementation details. The idea is that if you use it as a string it is string, if you need bytes, you need to perform a conversion. It shouldn't be your concern how it is stored internally.

If we are going into Python internals, the string can be stored as multiple versions from basic C-string to unicode code points. If you perform conversion it will cache the result so it can be used in other places. I don't remember the details, since I looked at the code long time ago, but it isn't that simple.



I don't know how to explain it any simpler. Iterating over a str type in Python 3 enumerates Unicode code points. The length of a str type is the number of code points it contains. Reversing a str will reverse the Unicode code points it contains (not guaranteed to be a sane operation). Indexing into a str with foo[0] gives you back a str containing a single Unicode code point.

This is not an implementation detail, it is fundamental to how the str type in Python 3 operates. I have not talked at any point about the internal storage of this type, just the interface it publicly exposes.


This is called a leaky abstraction. I can't find a good behavior for a high level language to do it this way. If you use index in a string you always will get something that's invalid, at least in Python or Java you get code points.


Python 3 strs should not be iterated over, sure. Ban that in your linter, then you're in the same position you would be in Rust. It's a misfeature but it's still a detail.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: