About 20 years ago, I implemented a hash table that used binary trees for buckets. It was supposed to support Unicode strings as keys, and I used the Windows comparison functions to figure out how to compare the keys, in order to handle case insensitive lookup IIRC.
I tested the table with randomly generated strings, and was puzzled to discover expected lookups would fail with some table constructions. I narrowed it down to inconsistencies in collation.
It turns out that the collation order implemented by Windows was not transitive. You could have three code points, a, b and c, where a > b and b > c and c > a. Sort order is well defined within blocks, but there isn't necessarily a meaningful sort order across blocks; how should your Cyrillic letter compare with your Greek letter vs your Armenian letter?
Case transformations are locale-dependent. That is, in French lower case of "I" would be "i", and upper case of "i" would be "I". In Turkish, which uses largely the same letters, lower case of "I" would be "ı", and upper case of "i" would be "İ". Also, in German, upper case of "s" is "S", but upper case of "ß" would be "SS", and you have to guess what lower case of "SS" would be.
Universal case insensitivity is hard if not impossible. It's best to preserve both a "canonical case" version and the raw data in a search index, if different.
Nice! German has funnier problems, though, because a vowel with an umlaut can be represented as that vowel + e, that is, "fuer" is a legit representation of "für". How do you normalize that? Turning to one canonical representation works most of the time, but sometimes you also need letter-to-letter correspondence.
It also gets interesting with proper names. It's perfectly legitimate to transliterate "Herr Schröder" to "Herr Schroeder" (and if you don't have ö at your disposal you have to), but a proper name that starts out with the transliteration, like "Dr. Oetker" can usually not be transliterated to "ö".
In some cases that might be because the "oe" was never an "ö" in this case, even if people might have shifted to pronouncing it that way. But in other cases I imagine that it was intended and did start out as an "ö" sound, but people just decided to write the name with "oe".
Is that really legit in German? I know Finnish umlauts (ä, ö) get sometimes mangled to ae or oe, but they are definitely not valid alternative spellings nor are they pronounced even close to similar.
This brings up something that is perhaps easy to forget: the interpretation of diacritics is far from universal. The diacritic in the character 'ä' can be one of two semantically different diacritics. It can be a diaeresis, a diacritic whose function is to mark that vowel starts the next syllable rather than existing as part of a diphthong; or it can be an umlaut, whose purpose is to indicate that it is a different vowel sound altogether. As far as any charset is concerned, though, despite those very different semantics, the two things are the same diacritical mark [1].
Even beyond the issue with two concepts using the same glyph, the interpretation of the same diactrics among different languages is inconsistent. English tends to drop diacritics to the point that many people think that English doesn't use them; German uses expansion (so ä becomes ae). As you mention, some languages are incomprehensible either way, so they need to be preserved. And sorting and collation is even more fun!
[1] This does mean that Unicode's insistence that characters represent semantic differences rather than graphical differences can come across as rather arbitrary. The original purpose of Unicode was to unify different character sets together, so it preserves character differentiation that existed in antecedent charsets but tends to otherwise unify characters in practice.
Yeah, completely legit and not uncommon at all (though advances in locale support have probably made it less necessary in recent decades). As an official transliteration, documented and taught in school, it is definitely pronounces the same.
It seems like German could actually be at fault for your bogus transliteration issues in Finnish, then. Sorry about that! 8)
The thing I most remember is the surprise of intransitive collation order. I also recall implementing case insensitive lookup.
It might have been an early version of this one: http://codecentral.embarcadero.com/Item/15171 - from 2001 - but I've written quite a few hash tables over the years, and may be blending the different recollections. I never used Vista, and I'm pretty sure my experience predated Windows 7.
In the first section points 16, 22, and 23 are relevant, in the second section look at points 8, 9, 10, 11, 12, 13, and 40.
Sorting based on Unicode strings is tricky, especially if you want cross platform compatibility. It's better today but there are still a lot of edge cases to consider if you have any hope of getting the same sort twice from "interesting" data.
That's one way to implement case-insensitive lookup. OTOH, for a programming language symbol table, maybe you want to look up with case sensitivity first, then a second time with case insensitivity so you can emit a warning about the discrepancy (for a case-insensitive language) or an error with suggested spelling correction (for a case-sensitive language) and you don't want to maintain two separate hash tables for every scope.
I tested the table with randomly generated strings, and was puzzled to discover expected lookups would fail with some table constructions. I narrowed it down to inconsistencies in collation.
It turns out that the collation order implemented by Windows was not transitive. You could have three code points, a, b and c, where a > b and b > c and c > a. Sort order is well defined within blocks, but there isn't necessarily a meaningful sort order across blocks; how should your Cyrillic letter compare with your Greek letter vs your Armenian letter?