HN2new | past | comments | ask | show | jobs | submitlogin

In this exchange UTF-8 got dragged into list of ugly hacks, but it is a beautiful hack.

Endian-independent, more efficient than UTF-16 for most languages (often including CJK web pages: halved cost of HTML & URLs makes up for 33% extra text cost), supports easy and safe substring search, can detect cut characters, and all that with ASCII and C-string backwards-compatibility.

If I could redesign entire computing platform from scratch UTF-8 is one thing I wouldn't change.



The disadvantage of utf-8 is that it's a variable length encoding. This means that certain operations which are usually O(1) are O(n) with UTF-8: one is finding the n-th character in a string, the other is finding the length in characters (though that's also true for null terminated C strings)

Another problem is that swapping a character in a string might cause the (byte) length of the string to change, which might force a reallocation (also slow) and if you're doing it in a loop over the string, your loop-ending condition might now have changed because the string (byte) length has changed.

In-fact, iterating over a UTF-8 strings characters is not something you can use a simple for loop any more, but something that requires at least one, possibly two function calls per character (one to find the next character, one for finding the end of the string which might just have moved due to your modification).

Finally, efficiency: For English texts, UTF8 is the most efficient Unicode encoding, but for other languages, that isn't true. A Chinese text would require three to four bytes per character, as opposed to just two in UCS-2 (which is what most OSes and languages use, even though it doesn't support encoding all of Unicode)

For these reasons, dealing with a fixed-length encoding is much more convenient (and speedier) while the string is loaded into memory. UTF8 is great for i/o and storage on disk, but in memory, it's inconvenient.

UCS-2 or UTF-16 is the reverse: it's very inconvenient on disk and for i/o (need I say more than BOM), but in-memory UCS-2 is very convenient, even though it doesn't support all of Unicode. It's in-fact so convenient that it's used by most programming environments (see yesterday's discussion about strings being broken)

Take python 3.3 and later for example: even though they now have full support for Unicode, also for characters outside of the BMP and thus requiring more than two bytes of storage, they didn't go with a variable length in-memory encoding, but they now chose the shortest width possible fixed length encoding that can be used for a particular string.

This seems like an awful lot of work to me, but they decided the fixed-lengthness was still worth it.


Umm, is there any unicode encoding where finding n-th character (not codepoint) in string is O(1) ? In any encoding you can have a single 'composite character' that consists of dozens of bytes, but needs to be counted as a single character for the purposes of string length, n-th symbol, and cutting substrings.

This is not a disadvantage of UTF-8 but of unicode (or natural language complexity) as such.


"UTF-32 (or UCS-4) is a protocol to encode Unicode characters that uses exactly 32 bits per Unicode code point. All other Unicode transformation formats use variable-length encodings. The UTF-32 form of a character is a direct representation of its codepoint." (http://en.wikipedia.org/wiki/UTF-32)

Of course, the problem of combining marks and CJK ideographs remains.


That's the point - you get O(1) functions that work on codepoints. Since for pretty much all practical purposes you don't want to work on codepoints but on characters, then codepoint-function efficiency is pretty much irrelevant.

I'm actually hard-pressed to find any example where I'd want to use a function that works on codepoints. Text editor internals and direct implementation of keyboard input? For what I'd say 99% of usecases, if codepoint-level functions are used then that's simply a bug (the code would break on valid text that has composite characters, say, a foreign surname) that's not yet discovered.

If a programmer doesn't want to go into detail of encodings, then I'd much prefer for the default option for string functions to be 'safe but less efficient' instead of 'faster but gives wrong results on some valid data'.


For a lot of usecases, you're just dealing with ASCII though (hello HTML). Wouldn't it be possible, in a string implementation, to have a flag indicating that the string is pure ASCII (set by the language internals), thereby indicating that fast, O(1) operations are safe to use?


What you say is done with UTF-8 + such a flag - if the string is pure ASCII (codes under 127) then the UTF8 representation is identical. IIRC latest python does exactly that, sending utf-8 to C functions that expect ascii, if they are 'clean'.

But for common what usecases you're just dealing with ASCII? Unless your data comes from a cobol mainframe, you're going to get non-ascii input at random places.

Html is a prime example of that - the default encoding is utf8, html pages very often include unescaped non-ascii content, and even if you're us-english only, your page content can include things such as accented proper names or the various non-ascii characters for quotation marks - such as '»' used in NYTimes frontpage.


It really depends on what kind of thing you are doing. Say you're processing financial data from a big CSV. Sure, you may run into non-ASCII characters on some lines. So what? As long as you're streaming the data line-by-line, it's still a big win. You could say the same for HTML - you're going to pay the Unicode price on accented content, but not with all your DOM manipulations which only involve element names (though I don't know by how much something like == takes a hit when dealing with non-ASCII), or when dealing with text nodes which don't have special characters.

I'm happy not to pay a performance price for things I don't use :)


The scenarios you mention would actually have significantly higher performance in UTF8 (identical to ASCII) rather than in multibyte encodings with fixed character size such as UCS2 or UTF32 that were recommended above. That's why utf-8 is the recommended encoding for html content.

Streaming, 'dealing with text nodes' while ignoring their meaning, and equality operations are byte operations that mostly depend on size of the text after encoding.


I think we're actually in agreement :)


You're still wrong on UCS-2: Except for JavaScript were it's still used (afaik there are unportable workarounds, but it's still specified that way) all others use UTF-16. Yes, many methods operate on code units instead of code points (because the character type is usually 16 bits wide), but code-point-aware methods usually exist in Java, .NET and others. Just because a string is treated as a sequence of UTF-16 code units that doesn't make it UCS-2. You could with the same argument say that everything that uses UTF-8 and gives you access to the underlying bytes doesn't support Unicode at all, but only ASCII.


> The disadvantage of utf-8 is that it's a variable length encoding.

So is UTF-16, by the way.


"Text normalization in Go" https://hackertimes.com/item?id=6806062 was here yesterday and made me wonder: how can you handle ligatures, accents, digraphs etc in fixed-width?

Perhaps text parsing is just a hard problem?


Yes it is hard. Instead of code points, humans usually handle text by graphemes, which have an arbitrary length in edge cases [0] no matter what encoding is used. An O(1) solution for graphemes would be an array of pointers to grapheme objects, which is slower and more memory hungry than necessary in most cases.

[0] http://stackoverflow.com/questions/6579844/how-does-zalgo-te...


O(n) is only an issue for large strings - let's say it might start being a concern at n > 1024, although for modern systems quite frankly n can be substantially larger without incurring any noticeable time loss. In my 20 year career, it has been rare to deal with strings larger than 1024 characters. In those cases, I don't think it's unreasonable to have to use a specialised class that tracks string indexing such that lookup becomes O(1). With enough mucking around you can even get insertion to be O(1).

For the remaining use cases of strings (the vast majority - filenames, URLs, UI labels, database fields), O(n) performance is perfectly adequate. This is why UTF-8 is such a successful hack.


Or, of course, many smaller strings...


Variable encoding is absolutely good choice for i18n. It forces you to give up random access, which is nearly impossible for i18n text processing (just thinking of a vowel followed by 2 or more accents). i18n libraries, such as popular regexp package, can easily deal with utf-8 with much penalty. Of course, the library has to deal with all kinds of complexity anyway, the random access was never a real need anyway. DISCLAIM: I contributed to regexp package of a major programming language.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: