Umm, is there any unicode encoding where finding n-th character (not codepoint) ...

rolux · on Nov 28, 2013

"UTF-32 (or UCS-4) is a protocol to encode Unicode characters that uses exactly 32 bits per Unicode code point. All other Unicode transformation formats use variable-length encodings. The UTF-32 form of a character is a direct representation of its codepoint." (http://en.wikipedia.org/wiki/UTF-32)

Of course, the problem of combining marks and CJK ideographs remains.

PeterisP · on Nov 28, 2013

That's the point - you get O(1) functions that work on codepoints. Since for pretty much all practical purposes you don't want to work on codepoints but on characters, then codepoint-function efficiency is pretty much irrelevant.

I'm actually hard-pressed to find any example where I'd want to use a function that works on codepoints. Text editor internals and direct implementation of keyboard input? For what I'd say 99% of usecases, if codepoint-level functions are used then that's simply a bug (the code would break on valid text that has composite characters, say, a foreign surname) that's not yet discovered.

If a programmer doesn't want to go into detail of encodings, then I'd much prefer for the default option for string functions to be 'safe but less efficient' instead of 'faster but gives wrong results on some valid data'.

mercurial · on Nov 28, 2013

For a lot of usecases, you're just dealing with ASCII though (hello HTML). Wouldn't it be possible, in a string implementation, to have a flag indicating that the string is pure ASCII (set by the language internals), thereby indicating that fast, O(1) operations are safe to use?

PeterisP · on Nov 28, 2013

What you say is done with UTF-8 + such a flag - if the string is pure ASCII (codes under 127) then the UTF8 representation is identical. IIRC latest python does exactly that, sending utf-8 to C functions that expect ascii, if they are 'clean'.

But for common what usecases you're just dealing with ASCII? Unless your data comes from a cobol mainframe, you're going to get non-ascii input at random places.

Html is a prime example of that - the default encoding is utf8, html pages very often include unescaped non-ascii content, and even if you're us-english only, your page content can include things such as accented proper names or the various non-ascii characters for quotation marks - such as '»' used in NYTimes frontpage.

mercurial · on Nov 28, 2013

It really depends on what kind of thing you are doing. Say you're processing financial data from a big CSV. Sure, you may run into non-ASCII characters on some lines. So what? As long as you're streaming the data line-by-line, it's still a big win. You could say the same for HTML - you're going to pay the Unicode price on accented content, but not with all your DOM manipulations which only involve element names (though I don't know by how much something like == takes a hit when dealing with non-ASCII), or when dealing with text nodes which don't have special characters.

I'm happy not to pay a performance price for things I don't use :)

PeterisP · on Nov 28, 2013

The scenarios you mention would actually have significantly higher performance in UTF8 (identical to ASCII) rather than in multibyte encodings with fixed character size such as UCS2 or UTF32 that were recommended above. That's why utf-8 is the recommended encoding for html content.

Streaming, 'dealing with text nodes' while ignoring their meaning, and equality operations are byte operations that mostly depend on size of the text after encoding.

mercurial · on Nov 28, 2013

I think we're actually in agreement :)