HN2new | past | comments | ask | show | jobs | submitlogin

Umm, is there any unicode encoding where finding n-th character (not codepoint) in string is O(1) ? In any encoding you can have a single 'composite character' that consists of dozens of bytes, but needs to be counted as a single character for the purposes of string length, n-th symbol, and cutting substrings.

This is not a disadvantage of UTF-8 but of unicode (or natural language complexity) as such.



"UTF-32 (or UCS-4) is a protocol to encode Unicode characters that uses exactly 32 bits per Unicode code point. All other Unicode transformation formats use variable-length encodings. The UTF-32 form of a character is a direct representation of its codepoint." (http://en.wikipedia.org/wiki/UTF-32)

Of course, the problem of combining marks and CJK ideographs remains.


That's the point - you get O(1) functions that work on codepoints. Since for pretty much all practical purposes you don't want to work on codepoints but on characters, then codepoint-function efficiency is pretty much irrelevant.

I'm actually hard-pressed to find any example where I'd want to use a function that works on codepoints. Text editor internals and direct implementation of keyboard input? For what I'd say 99% of usecases, if codepoint-level functions are used then that's simply a bug (the code would break on valid text that has composite characters, say, a foreign surname) that's not yet discovered.

If a programmer doesn't want to go into detail of encodings, then I'd much prefer for the default option for string functions to be 'safe but less efficient' instead of 'faster but gives wrong results on some valid data'.


For a lot of usecases, you're just dealing with ASCII though (hello HTML). Wouldn't it be possible, in a string implementation, to have a flag indicating that the string is pure ASCII (set by the language internals), thereby indicating that fast, O(1) operations are safe to use?


What you say is done with UTF-8 + such a flag - if the string is pure ASCII (codes under 127) then the UTF8 representation is identical. IIRC latest python does exactly that, sending utf-8 to C functions that expect ascii, if they are 'clean'.

But for common what usecases you're just dealing with ASCII? Unless your data comes from a cobol mainframe, you're going to get non-ascii input at random places.

Html is a prime example of that - the default encoding is utf8, html pages very often include unescaped non-ascii content, and even if you're us-english only, your page content can include things such as accented proper names or the various non-ascii characters for quotation marks - such as '»' used in NYTimes frontpage.


It really depends on what kind of thing you are doing. Say you're processing financial data from a big CSV. Sure, you may run into non-ASCII characters on some lines. So what? As long as you're streaming the data line-by-line, it's still a big win. You could say the same for HTML - you're going to pay the Unicode price on accented content, but not with all your DOM manipulations which only involve element names (though I don't know by how much something like == takes a hit when dealing with non-ASCII), or when dealing with text nodes which don't have special characters.

I'm happy not to pay a performance price for things I don't use :)


The scenarios you mention would actually have significantly higher performance in UTF8 (identical to ASCII) rather than in multibyte encodings with fixed character size such as UCS2 or UTF32 that were recommended above. That's why utf-8 is the recommended encoding for html content.

Streaming, 'dealing with text nodes' while ignoring their meaning, and equality operations are byte operations that mostly depend on size of the text after encoding.


I think we're actually in agreement :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: