UTF-8 is one of the most brilliant things I've ever seen. I only wish it had been invented and caught on before so many influential bodies started using UCS-2 instead.
Like anything new, people had a hard time with it at the beginning.
I remember that I got a home assignment in an interview for a PHP job. The person evaluating my code said I should not have used UTF8, which causes "compatibility problems". At the time, I didn't know better, and I answered that no, it was explicitly created to solve compatibility problems, and that they just didn't understand how to deal with encoding properly.
Needing-less to say, I didn't get the job :)
Same with Python 2 code. So many people, when migrating to Python 3, suddenly though python 3 encoding management was broken, since it was raising so many UnicodeDecodingError.
Only much later people realize the huge number of programs that couldn't deal with non ASCII characters in file paths, html attributes or user names, because they just implicitly assume ASCII. "My code used to work fine", they said. But it worked fine on their machine, set to an english locale, tested only using ascii plain text files on their ascii named directories with their ascii last name.
That's in general a problem with dynamic languages with weak type systems. How "Your code runs without crashing" is really really != "your code works". How do people even manage production python! A bug could be lurking anywhere, undetected until it's actually run. Whereas in a compiled language with a strong type system, "your code compiles" is much closer to "your code is correct".
A type system can refuse to turn a `bytes` into a `utf8str` until it's been appropriately parsed.
(It doesn't even need to be a very good or strongly-enforced type system - Go makes it dangerously easy to convert between `[]byte` and `string` by other-type-system standards, and yet everything works pretty well. It's enough to hitch your thinking and make you realize you need another step.)
But this is not a matter of UTF-8 in the code, rather a matter of UTF-8 in the input or output. How does compiling a program ensure that it is robust on a range of inputs?
> How does compiling a program ensure that it is robust on a range of inputs?
This is quite literally the job of a type system: to impose a semantic interpretation on sequences of "raw bits" and let you specify legal (and only legal) operations in terms of the semantic interpretation rather than the bits.
There are a number of mitigations, so those kind of bugs are quite rare. In our large code base, about 98% of bugs we find are of the "we need to handle another case" variety. Pyflakes quickly finds typos which eliminates most of the rest.
This is the difference between people who embrace static typing and everyone else. A static type lover hears that 98% of your bugs are of the "we need to handle another case" variety and says, "well, that means you could have gotten rid of 98% of your bugs with better typing".
No, what I mean is that an additional key comes in (with the json or similar hash) and we now need to do some thing with it, or something different than we thought we were supposed to with it. Typing is not going to fix it because the full cases were unknown at development time.
How is it anything but the truth? The express purpose of static analysis, like a type system, is to catch bugs before running your code. That pretty clearly means that code that successfully compiles is closer to being correct than code that doesn't.
The parser assures your code is grammatically correct; the type system assures your code is semantically consistent, which is usually a much stronger guarantee, and by most practical measures will be closer - often much closer, and for total functions on total types, sometimes all the way - to "logically correct".
Python 3 encoding management was broken, because it tried to impose Unicode semantics on things that were actually byte streams. For anyone actually correctly handling encodings in Python 2 it was awful because suddenly the language runtime was hiding half the data you needed.
Nowadays, passing bytes to any os function returns bytes objects, not unicode. You'll get string if you pass string objects though, and they will be using utf8 surrogate escaping.
And nowadays, lots of people left the Python ecosystem completely because the 3 upgrade was broken and raising so many UnicodeDecodingErrors for so long. I'm glad it's fixed but it cost too much.
(And it had nothing to do with UTF-8. Actually, it's at least partially caused by the CPython developers avoiding UTF-8 for poor reasons.)
I'd argue that the people correctly handling encodings in Python 2 were vastly outnumbered by the people that weren't, but were getting away with it because the code didn't outright crash. Now in Python 3 it crashes, which is a pain in the short term but better in the long run.
I don't really think it's better - certainly it was not better until many years after release when they started admitting their big mistakes.
But even today, codepoint indexing is too easy and doesn't crash so lots of code is still subtly wrong. Memory usage of a string grows 3x if you add a single emoji. Most libraries are still agnostic about whether they take bytes or str so you still get exceptions thrown with no easy solution. (The growing popularity of type hints is fixing this, but that's not really related to Python 3.)
My Slack name at work is "Τĥιs ñåmè įß ą váĺîδ POSIX paτĥ". My hope is that it serves as an amusing reminder to consider things like spaces and non-ASCII characters.
One of my friends is these days a colleague, with an utterly ordinary English name but his identity management data is full of spurious accents to check APIs do the Right Thing™.
I was delighted recently to stumble on a history of modern women course HIST1158 named Liberté Egalité Beyoncé and I immediately thought two things: 1. Why are our Computer Science courses given unimaginative names? and 2. What a useful test input, I bet some of our systems don't work correctly for this input even though an acute accent is hardly a bleeding edge feature.
I haven't been able to interest any Computer Science professors in fun names for their courses, but I was able in my test environment to name a COMP series course "Untitled Course Name" with a description explaining that "It is a lovely day in the village and there are only two hard problems in Computer Science".
Absolutely. At least it’s well supported now in very old languages (like C) and very new languages (like Rust). But Java, Javascript, C# and others will probably be stuck using UCS-2 forever.
There's actually a proposal with a decent amount of support to add utf-8 strings to C#. Probably won't be added to the language for another 3 or 4 years (if ever) but it's not outside the realm of possibility.
>What is stopping [...] Java, JS, and C# files in UTF-8?
The output of files on disk can be UTF-8. The continued use of UCS-2 (later revised to UTF16) is happening in the runtime because things like the Win32 API which C# uses is UCS-2. The internal raw memory of layout of strings in Win32 is UCS-2.
Code page 65001 has existed for a long time now, but it was discouraged because there were a lot of corner cases that didn't work. Did they finally get all the kinks out of it?
When Windows adopted Unicode, I think the only encoding available was UCS-2. They converted pretty quickly to UTF-16 though, and I think the same is true of everybody else who started with UCS-2. Unfortunately UTF-16 has its own set of hassles.
Yeah, there's sometimes a lot more hacks like WTF-8 and WTF-16 in practice on UCS-2 originally systems (including Windows and JS) than is healthy: https://simonsapin.github.io/wtf-8/
Nothing at all, and in fact there's a site set up specifically to advocate for this: https://utf8everywhere.org/
The biggest problem is when you're working in an ecosystem that uses a different encoding and you're forced to convert back and forth constantly.
I like the way Python 3 does it - every string is Unicode, and you don't know or care what encoding it is using internally in memory. It's only when you read or write to a file that you need to care about encoding, and the default has slowly been converging on UTF-8.
The problem with "every string is Unicode" is if you want to represent things that look like unicode but aren't really guaranteed to be unicode. This includes filenames on Windows (WTF-18 aka arbitrary WCHAR sequences) and Linux (arbitrary byte sequences) that are interpreted as UTF-16 / UTF-8 for display purposes but limiting yourself to valid UTF-16 / UTF-8 means that you cannot represent all paths that you might come across.
Yep. In javascript (and Java and C# from memory) the String.length property is based on the encoding length in UTF16. It’s essentially useless. I don’t know if I’ve ever seen a valid use for the javascript String.length field in a program which handles Unicode correctly.
There’s 3 valid (and useful) ways to measure a string depending on context:
- Number of Unicode characters (useful in collaborative editing)
- Byte length when encoded (these days usually in utf8)
- and the number of rendered grapheme clusters
All of these measures are identical in ASCII text - which is an endless source of bugs.
Sadly these languages give you a deceptively useless .length property and make you go fishing when you want to make your code correct.
This is also rarely useful unless you are working with a monospace font where all grapheme clusters have the same width, which is probably none if you support double-width characters. More likely what you are interested in is the display length with a particular font or column count with a monospace font.
Java's char is a strong competitor for most stupid "char" type award.
I would give it to Java outright if not for the fact that C's char type doesn't define how big it is at all, nor whether it is signed. In practice it's probably a byte, but you aren't actually promised that, and even if it is a byte you aren't promised whether this byte is treated as signed or unsigned, that's implementation dependant. Completely useless.
For years I thought char was just pointless, and even today I would still say that a high level language like Java (or Javascript) should not offer a "char" type because the problems you're solving with these languages are so unlikely to make effective use of such a type as to make it far from essential. Just have a string type, and provide methods acting on strings, forget "char". But Rust did show me that a strongly typed systems language might actually have some use for a distinct type here (Rust's char really does only hold the 21-bit Unicode Scalar Values, you can't put arbitrary 32-bit values in it, nor UTF-16's surrogate code points) so I'll give it that.
And POSIX does guarantees that CHAR_BIT == 8 so in practice this is only a concern on embedded platform where you are only dealing with "C-ish" anyway.
How many non-embedded non-POSIX systems do you know? Windows also guarantees CHAR_BIT == 8 and since most software is first written for Windows or POSIX there is plenty of software that assumes that CHAR_BIT == 8. That means that anything that will want to run general software needs to also ensure CHAR_BIT == 8 - not to mention all the algorithms and data formats designed around you being able to efficiently access octets. The only platforms that can get away with CHAR_BIT != 8 are precisely those that have software specially written for them, i.e. embedded systems.
You’re being far too harsh. The Java char type isn’t “stupid”; really, it’s just unfortunate in hindsight. There are plenty of decisions that were stupid even at the time they were decided, and this isn’t that: people actually thought that 2 bytes was enough for all characters, and Han unification was going to work. Looking backward this is “obviously” futile but certainly not then.
C’s character type, FWIW, has a use: it more or less indicates the granularity that is efficiently addressable by the host architecture. Trying to use it for more than that is generally not that fruitful, but it definitely has a purpose and it’s pretty good at that.
Finally, speaking of unfortunate decisions, Rust happens to make one that I don’t particularly like: it lets you misalign characters (and panics), which is…not great. It would be much nicer if the view just don’t let you do this unless you specifically asked for bytes or something.
When Java was first conceived UTF-16 didn't exist, but we shouldn't rewrite history entirely here, Java 1.0 and Unicode 2.0 (with UTF-16) are from the same year. It would have been wiser (albeit drastic) to pull char in Java 1.0, reserve the word char and the literal syntax and spend a year or two deciding what you actually wanted here in light of the fact Unicode is not going to be a 16-bit encoding.
And again, I don't think Java probably needed 'char' at all, it's the sort of low-level implementation detail Java has been trying to escape from so this is a needless self-inflicted wound. I think there's a char in Java for the reason it has both increment operators - C does it and Java wants to look like C so as not to scare the C programmers.
C's unsigned char could just be named "byte" and if signed char must exist, call that "signed byte". The old C standard actually pretends these are characters, which of course today they clearly aren't which is why this is a thread about UTF-8. I don't have any objection to a byte type especially in a low-level language.
Presumably your Rust annoyance is related to things like String::insert? But I don't understand how this problem arises, if you are inserting characters at random positions in a String, that's just going to be nonsense. I can't conceive of a situation where I want to insert characters (or sub-strings) unless I know where they're supposed to go exactly relative to what is in the string already, whereupon it won't panic.
I don’t get your argument at all. People want characters from their strings, and around the time Java decided on UTF-16 because at the time it seemed like the “right” way to do Unicode. What would you suggest they have adopted back then? Similarly C’s char type is named “char” because people dealt with ASCII back then and characters used to be a byte. It turns out that sucks but being able to do byte arithmetic is cool so it’s still around for that purpose (and C++ actually has added std::byte for exactly this; perhaps C will get it as well at some point). For Rust, this is just a thing about holding it wrong: the operation is generally not relevant, so why even expose it? It doesn’t make sense to allow for random indexing if you’re just going to crash on misalignment. It would be better to just have an API that doesn’t allow misalignment at all: see Swift’s implementation for example.
People should stop wanting "characters from their strings" especially in the sort of high level software you'd attempt in Java - and Java was in a good position to do that the way we've successfully done it for similar things, by not providing the misleading API shape. Reserve char but don't implement it is what I'm saying, like goto.
Compare for example decryption, where we learned not to provide decrypt(someBytes) and checkIntegrity(someBytes) even though that's what People often want, it's a bad idea. Instead we provide decrypt(wholeBlock) and you can't call it until you've got a whole block we can do integrity checks on, it fails without releasing bogus plaintext if the block was tampered with. An entire class of stupid bugs becomes impossible.
Java should have provided APIs that work on Strings, and said if you think you care about the things Strings are made up of, either you need a suitable third party API (e.g. text rendering, spelling) or you want bytes because that's how Strings are encoded for transmission over the network or storage on disk. You don't want to treat the string as a series of "characters" because they aren't.
The idea that a String is just a vector of characters is wrong, that's not what it is at all. A very low level language like C, C++ or Rust can be excused for exposing something like that, because it's necessary to the low-level machinery, but almost nobody should be programming that layer.
Imagine if Java insisted on acting as though your Java references were numbers and that it could make sense to add them together. Sure in fact they are pointers, and the pointer is an integral type and so you could mechanically add them together, but that's nonsense, you would never write code that needs to do this in Java.
K&R C claimed that char isn't just for representing "ASCII" (which wasn't at that time set in stone as the encoding you'll be using) but for representing the characters on the system you're programming regardless of whether they're ASCII. 'A' wasn't defined as 65 but as whatever the code happens to be for A on your computer. Presumably the current ISO C doesn't make the same foolish claim.
I think you're being too harsh on the C char. It is guaranteed sizeof(char) == 1, and it is guaranteed to be at least 8 bits long, i.e. long enough for any ascii character.
These requirements are perfectly good for the needs of a CHARacter type. If you need to control signed / unsigned because you want to use the char as a small integer, you can specify yourself whether it is signed or not.
In reality, where chars are used to store ASCII, the signdness of the datatype is meaningless because the highest bit is never set.
The really tragic thing is that UTF-8 was invented before UTF-16. But a few big companies had put a couple of years of heavy investment into UCS-2, and weren’t willing to let that and the attendant pain they were beginning to foist on developers and users alike go to waste, and so ruined Unicode with UTF-16 and the disaster called surrogates that is the cause of the significant majority of programming languages handling strings incorrectly (e.g. JavaScript uses potentially ill-formed Unicode, indexed by UTF-16 code unit; and Python strings are sequences of code points rather than scalar values). If only they had reversed course and said “sorry for all the pain we were just starting to put you through, that fixed-width 16-bit encoding thing didn’t pan out, we’re going back to 8-bit encodings with this UTF-8 thing that is conveniently also backwards-compatible with ASCII”. If only.
Constant time subscripting is a myth. There's nothing(*) useful to be obtained by adding a fixed offset to the base of your string, in any unicode encoding, including UTF-32.
If you're hoping that a fixed offset gives you a user-percieved character boundary, then you're not handling composed characters or zero-width-joiners or any number of other things that may cause a grapheme cluster to be composed of multiple UTF code points.
The "fixed" size of code points in encodings like UTF-32 are just that: code points. Whether a code point corresponds with anything useful, like the boundary of a visible character, will always require linear-time indexing of the string, in any encoding.
(*) Approximately nothing. If you're in a position where you've somehow already vetted that the text is of a subset of human languages where you're guaranteed to never have grapheme clusters that occupy more than a single code point, then you maybe have a use case for this, but I'd argue you really just have a bunch of bugs waiting to happen.
Getting tired of people calling things "useless". Clearly I have a usecase for fixed width text encodings.
Source code manipulation is frequently Unicode aware but doesn't care about combinations or things outside of a strict subset of Unicode to modify lexing control flow.
Being able to store (and later refer to) character offsets in the source code is a plus because they'll only ever occur in places where the strict subset is enforced.
This is especially true of languages with line-only comments, etc, where different writing systems being used won't affect the error message information.
Like I said, there are a few useful cases where having a fixed width encoding is beneficial. It's less helpful to the discussion to assert you know better for every case, ever.
> Constant time subscripting is a myth. There's nothing(*) useful to be obtained by adding a fixed offset to the base of your string, in any unicode encoding, including UTF-32.
What about UTF-256? Maybe not today, maybe not tomorrow, but someday...
I know you're kidding, but I want to note that UTF-256 isn't enough. There's an Arabic ligature that decomposes into 20 codepoints. That was already in Unicode 20 years ago. You can probably do something even crazier with the family emoji. These make "single characters" that do not have precomposed forms.
Also, if you want O(1) indexing by grapheme cluster you can get that with less memory overhead by precomputing a lookup table of the location in the string where you can find every k-th grapheme cluster, for some constant k >= 1. (This requires a single O(n) pass through the string to build the index, but you were always going to have do make at least one such pass through the string for other reasons.)
I see this mentioned periodically in discussions about UTF-8 and it just doesn't seem to match reality. Very often you can be certain you're not operating with multi-codepoint grapheme clusters. Whether through string literals, conversion from other types (e.g., numeric to string), restrictions on identifiers, specification for file formats, company-policy on language for source files, conversion from strings with an ASCII charset, etc., you very often can be certain about the contents of that string. And optimizing around that information is considerably faster than a naive linear scan for the string, constantly rediscovering properties about that string.
E.g., Ruby runtimes scan the bytes in a string and then cache data about them in value called a code range. Knowing the code range, you can optimize many operations to not require additional linear scans of the string. Knowing a UTF-8 string consists only of ASCII characters can allow operations to be just as fast as if the string truly were ASCII-only (Ruby supports 100+ string encodings). And that fact is used throughout the core library to provide fast implementations of many operations (upcase, downcase, capitalize, gsub, substring, and so on). Moreover, a JIT can generate extremely tight code in those situations. Having to take a linear pass through the string to discover codepoint boundaries incurs a huge performance cost. While all strings could be treated uniformly and use Unicode tables for case mapping and such, the extra overhead is brutal. It has a measurable impact on string-heavy applications, such as template rendering and text processing.
In the most general case, yes, you know nothing about the string and can't make any assumptions. You can't even be sure the byte sequence is valid UTF-8. But, very often you do know properties of those strings. And you can manage boundaries where strings with known properties are joined with strings with unknown properties (e.g., variable interpolation in a template file).
> Whether through string literals, conversion from other types (e.g., numeric to string), restrictions on identifiers, specification for file formats, company-policy on language for source files, conversion from strings with an ASCII charset, etc., you very often can be certain about the contents of that string.
With the exception of conversion from numbers (which has its own optimizations that are likely equally applicable in UTF8 since Arabic numbers are just ASCII anyway), I’d say all of your examples sound like bugs waiting to happen.
Why shouldn’t string literals be allowed to contain complex emoji? Why should identifiers disallow them? Why should there be a company policy around putting complex emoji places?
Just saying “let’s just declare things such that strings aren’t allowed to have multi-code point grapheme clusters” sounds great until you accidentally let that assumption leak into a place where a user wants to use an emoji and can’t make it match their skin tone.
I’d also say that such restrictions are putting the cart before the horse; the typical reasons for restricting the allowed character set, are precisely because you want to make lazy assumptions about things like string offsets. Saying that such assumptions are a good thing because you have these restrictions in place, seems like circular logic to me.
> Why shouldn’t string literals be allowed to contain complex emoji?
I didn't say they shouldn't, just that many do not and you know that at parse time.
> Why should identifiers disallow them?
I don't write the language specs. Many languages don't allow classes, methods, variables, etc. to have complex grapheme clusters in them.
> Why should there be a company policy around putting complex emoji places?
Performance. Code sanity. Indexing. Ease of typing. Again, I'm not the one writing the policies. But, they exist.
> Just saying “let’s just declare things such that strings aren’t allowed to have multi-code point grapheme clusters” sounds great until you accidentally let that assumption leak into a place where a user wants to use an emoji and can’t make it match their skin tone.
I'm making a clear distinction between situations where you have user-supplied data and data under control of the language runtime, developer-created files, or those just adhering well-defined file formats. These are all strings and commonly consist of simple codepoints; indeed, many times they're just ASCII characters.
I addressed user-supplied values when I wrote "And you can manage boundaries where strings with known properties are joined with strings with unknown properties (e.g., variable interpolation in a template file)." TruffleRuby, for example, uses ropes as its underlying structure, so if you have a template written using all ASCII characters (rather common) and interpolate a user-supplied value, you can put the user string in one rope, the template in others and link them all together into a tree with ConcatRopes. The template ropes still know they only have simple codepoints and operations on those parts can be fast. The user variable only knows it's a generic UTF-8 string and operations on that string, if any, can go down the slower path. Oftentimes, there are no operations to perform on that user string other than to display it. Its mere presence doesn't need to adversely affect the rest of the template.
> I’d also say that such restrictions are putting the cart before the horse; the typical reasons for restricting the allowed character set, are precisely because you want to make lazy assumptions about things like string offsets. Saying that such assumptions are a good thing because you have these restrictions in place, seems like circular logic to me.
I'm not making any assumptions. I've spent an awful lot of time optimizing string performance in the context of a Ruby runtime and your initial claim of constant-time subscripting being a myth doesn't match my experience. Ruby allows as complex of a string as you want, but the reality is there are many situations where strings, by either by restrictions or de facto, will not have multi-codepoint grapheme clusters. In many situations you'll have strings with all the codepoints in the ASCII range. If the only information the runtime records when parsing a string is that "this is a UTF-8" string and then operates on all UTF-8 strings uniformly, you leave a lot of performance on the table. The best performing situation is when you don't have to deal with variable-width codepoints in a UTF-8 string. UTF-16 and UTF-32 aren't terribly common in Ruby, but they exist as valid encodings (well, UTF-16BE/UTF-16LE and UTF-32BE/UTF-32LE) and have simpler execution paths than UTF-8 for many use cases.
> For example, constant time subscripting, or improved length calculations, are made possible by encodings other than utf-8.
Assuming you mean different encoding forms of Unicode (rather than entirely different and far less comprehensive character sets, such as ASCII or Latin-1), there are very few use cases where "subscripting" or "length calculations" would benefit significantly from using a different encoding form, because it is rare that individual Unicode code points are the most appropriate units to work with.
(If you're happy to sacrifice support for most of the world's writing systems in favour of raw performance for a limited subset of scripts and text operations, that's different.)