About 20 years ago, I implemented a hash table that used binary trees for bucket...

nine_k · on Jan 8, 2019

Case transformations are locale-dependent. That is, in French lower case of "I" would be "i", and upper case of "i" would be "I". In Turkish, which uses largely the same letters, lower case of "I" would be "ı", and upper case of "i" would be "İ". Also, in German, upper case of "s" is "S", but upper case of "ß" would be "SS", and you have to guess what lower case of "SS" would be.

Universal case insensitivity is hard if not impossible. It's best to preserve both a "canonical case" version and the raw data in a search index, if different.

krylon · on Jan 8, 2019

There is an uppercase ß, actually: https://en.wikipedia.org/wiki/Capital_%E1%BA%9E

Not that it has any impact on your point. But as a German who only learned about it fairly recently, I have rather ambivalent feelings about it. ;-)

nine_k · on Jan 8, 2019

Nice! German has funnier problems, though, because a vowel with an umlaut can be represented as that vowel + e, that is, "fuer" is a legit representation of "für". How do you normalize that? Turning to one canonical representation works most of the time, but sometimes you also need letter-to-letter correspondence.

anyfoo · on Jan 8, 2019

It also gets interesting with proper names. It's perfectly legitimate to transliterate "Herr Schröder" to "Herr Schroeder" (and if you don't have ö at your disposal you have to), but a proper name that starts out with the transliteration, like "Dr. Oetker" can usually not be transliterated to "ö".

In some cases that might be because the "oe" was never an "ö" in this case, even if people might have shifted to pronouncing it that way. But in other cases I imagine that it was intended and did start out as an "ö" sound, but people just decided to write the name with "oe".

(Same with the other Umlauts.)

chousuke · on Jan 8, 2019

Is that really legit in German? I know Finnish umlauts (ä, ö) get sometimes mangled to ae or oe, but they are definitely not valid alternative spellings nor are they pronounced even close to similar.

jcranmer · on Jan 8, 2019

This brings up something that is perhaps easy to forget: the interpretation of diacritics is far from universal. The diacritic in the character 'ä' can be one of two semantically different diacritics. It can be a diaeresis, a diacritic whose function is to mark that vowel starts the next syllable rather than existing as part of a diphthong; or it can be an umlaut, whose purpose is to indicate that it is a different vowel sound altogether. As far as any charset is concerned, though, despite those very different semantics, the two things are the same diacritical mark [1].

Even beyond the issue with two concepts using the same glyph, the interpretation of the same diactrics among different languages is inconsistent. English tends to drop diacritics to the point that many people think that English doesn't use them; German uses expansion (so ä becomes ae). As you mention, some languages are incomprehensible either way, so they need to be preserved. And sorting and collation is even more fun!

[1] This does mean that Unicode's insistence that characters represent semantic differences rather than graphical differences can come across as rather arbitrary. The original purpose of Unicode was to unify different character sets together, so it preserves character differentiation that existed in antecedent charsets but tends to otherwise unify characters in practice.

tedunangst · on Jan 9, 2019

Pretty much the only place you see diaeresis in English is in the New Yorker, whenever they use a word like coördination.

anyfoo · on Jan 8, 2019

Yeah, completely legit and not uncommon at all (though advances in locale support have probably made it less necessary in recent decades). As an official transliteration, documented and taught in school, it is definitely pronounces the same.

It seems like German could actually be at fault for your bogus transliteration issues in Finnish, then. Sorry about that! 8)

zaarn · on Jan 9, 2019

The handling of "ß" was recently changed; "ẞ" is the new uppercase form of "ß" since 2017 (Unicode has it since 2007)

leeter · on Jan 8, 2019

IIRC the newer Unicode collation sensitive comparison functions hadn't been implemented until Vista

Looks like that's the case: https://docs.microsoft.com/en-us/windows/desktop/api/stringa...

barrkel · on Jan 8, 2019

The thing I most remember is the surprise of intransitive collation order. I also recall implementing case insensitive lookup.

It might have been an early version of this one: http://codecentral.embarcadero.com/Item/15171 - from 2001 - but I've written quite a few hash tables over the years, and may be blending the different recollections. I never used Vista, and I'm pretty sure my experience predated Windows 7.

I also wrote JclStrHashMap to support JclExprEval: https://github.com/project-jedi/jcl/blob/master/jcl/source/c...

I also wrote the Delphi runtime library TDictionary generic implementation, but that was more recent.

Update I found it: original discussion, from newsgroups, has been ported to the web: http://www.delphigroups.info/2/62/478610.html

It was from 2004, so more like 15 years ago. It's mildly painful to read myself from back then too :)

oconnor663 · on Jan 8, 2019

> I used the Windows comparison functions to figure out how to compare the keys, in order to handle case insensitive lookup IIRC.

Wouldn't you need to do case normalization before hashing, to make that work?

jandrese · on Jan 9, 2019

Case normalization is it's own minefield in Unicode.

This article discusses some of the challenges: https://stackoverflow.com/questions/6162484/why-does-modern-...

In the first section points 16, 22, and 23 are relevant, in the second section look at points 8, 9, 10, 11, 12, 13, and 40.

Sorting based on Unicode strings is tricky, especially if you want cross platform compatibility. It's better today but there are still a lot of edge cases to consider if you have any hope of getting the same sort twice from "interesting" data.

barrkel · on Jan 8, 2019

That's one way to implement case-insensitive lookup. OTOH, for a programming language symbol table, maybe you want to look up with case sensitivity first, then a second time with case insensitivity so you can emit a warning about the discrepancy (for a case-insensitive language) or an error with suggested spelling correction (for a case-sensitive language) and you don't want to maintain two separate hash tables for every scope.