Tuesday, October 16, 2007

Unicode implementer's guide part 5: Collation

What could be simpler than alphabetical order1? It's a simple lexicographic comparison of two strings, comparing the first character to see which is greater, then the second as a tie breaker, then the third, and so on. Unicode makes a lot of things harder, but this shouldn't be any different. Unicode doesn't prohibit this kind of alphabetical order; you can have a 'Unicode-conformant' process which uses this order.

Great idea, but...

This kind of order, which we could call code point order, is useless. Well, it's not completely useless; it's fine if you need an arbitrary, consistent order. But for humans, it's useless. It puts Z before a. In Normalization Form D (which I'll assume strings are in for the rest of the article), it puts the "äa" after "az". It puts "aa" after "a-z". It puts "æ" after "af". All of these are backwards for users in the American English locale and most other locales.

It's not just that the letters are in the wrong order. It also won't be sufficient to put all characters in a consistent case, as the R6RS Unicode module authors seem to think, as this will still give unnatural behavior for punctuation, accent marks, ligatures, and other things. Removing all of these features won't do it either; they can't be completely ignored, becase "foot ball" should come consistently after "football", and "Apple" should come consistently either before or after "apple", depending on the needs of the user.

It's crucial that things are sorted correctly; if they are in a different order than the user expects, then a user, looking at a long alphabetized list, may conclude that an item on the list doesn't exist.

A better way for people

To get things right, the way humans expect them, you need another few levels of tiebreakers. When comparing two strings to see which one comes first, there's a number of steps: First, compare two strings with all of the features stripped. If they're equal, compare the accent marks of the strings, as a tie breaker. If those are equal again, compare the capitalization as a second tie breaker. And if those are still equal, compare the punctuation.

I left out one thing--ligatures like æ. This can be expanded to something that looks like "ae" in the first stage, but looks different in the accent mark tiebreaker. We'll call them expansions. In some languages, like Danish, there are also contractions like "aa" with their own special alphabetization; these contract into a single value for the initial sort, if the Danish locale is used. In Thai and Lao scripts, there is an additional problem of certain vowels and consonants reversing order for collation purposes. This can be achieved with a combination of the expansion and contraction patterns. There's also a special problem for Korean collation2.

UTS #10

Believe it or not, I didn't come up with this all by myself. It's part of the Unicode Standard, in Unicode Techinical Standard #10: Collation. UTS #10 isn't something that's required of Unicode implementations, but it is specifically standardized and codified. It does basically what I described in the section above, with a slight change for efficiency. Because the folks at the Unicode Consortium did all this work, we don't have to! They've even given us a convenient table of data that makes it easier to implement the above tiebreakers.

UTS #10 suggests a simple way for implementing these tiebreakers that works for efficiently for sorting large lists of strings: calculate a collation key, a bit array, for each string, and compare these collation keys with a simple bit comparison. This collation key can be calculated in linear time and space with respect to the string, so a human-appropriate comparison is asymptotically no worse than a naive one and suffers only linear overhead.

Collation keys and the DUCET

But how do we make the collation key? Well, that table I mentioned earlier is called DUCET, the Default Unicode Collation Element Table. It describes a neutral locale which is linguistically inaccurate everyone. For most3 characters in NFD, there is a three-tuple4 of the primary, secondary, and tertiary collation weights. In assembling the collation key from a string, go through the string, first appending the primary weights of each character5, then a separator of 0, then all of the secondary weights, followed by another separator, and finally the tertiary weights. If one of these weights for a particular character is 0, leave it out. This is all followed by a representation of the punctuation6.

A simple binary comparison of the collation keys ends up being equivalent to what I described before. How? When the primary weights are appended, they form something equivalent to the characters all put in a consistent case, with accent marks and punctuation removed. In a binary comparison of collation keys, these are compared first. They are followed by a separator of 0 so that, if the strings' primary weights are equivalent up to the end of the shorter one, the longer string will be sorted second. This is because 0 is less than any collation weight. The secondary collation weight represents the accent marks, and the tertiary collation weight represents the capitalization. (See note 5 about punctuation.)

Implementation and tailoring

I wish I could give hints on implementation, but, umm... I haven't written a working one yet. It's quite an involved algorithm, at least compared to things I've done before. Luckily, the UTS gives lots of hits for how to do it. One thing that makes implementation easier is that the collation weights of many characters, mostly Han ideographs, can be derived algorithmically through a formula specified in UTS #10. In fact, this is somewhat unfortunate, as it is a symptom of the fact that the ordering of Han ideographs requires a more involved process, using some sort of dictionary lookup for pronunciation or radical index, depending on the context.

But implementation comes with an additional challenge: tailoring to specific locales. The DUCET isn't linguistically accurate for any locale, and must always be modified somewhat. Different contexts such as language, nation, context and organizational preference can demand slightly different collation orders that users depend on. Modifying the collation table to suit this is known as tailoring. I'm not yet sure what algorithm should be used to reshuffle the collation weights in the right way. But there is a ton of locale data in the Common Locale Data Repository (CLDR), a project run by the Unicode Consortium. This includes information about collation.

Conclusion

Collation isn't easy, but it's necessary to get right. UTS #10 lays out an algorithm for many aspects of collation, but still leaves some things out: there's no specified process for ignoring words like 'the' in the beginning of a book title, or padding numbers with zeros so that 2 comes before 10. Nevertheless, with this algorithm and appropriate locale data, it's possible to give most users a linguistically accurate alphabetical order which makes sense to humans.



1 Actually, we should probably call it collation, since alphabetical order implies we're working with an alphabet. With Unicode, some scripts don't use an alphabet.

2 There are a number of different ways to sort Korean Hangul syllables appropriately. The precomposed Hangul syllables are already in the correct collation order, but a problem arises in the representation of old Hangul texts, where multiple initial, medial or final jamo can be used. (In modern Korean, only one is used per syllable.) There is more than one way to sort this. Because of an apparent bureaucratic argument between the Unicode Consortium and the ISO (who publishes the parallel standard ISO 10646), none of these are encoded in the DUCET. The one that sounds the simplest to implement is to decompose all syllables and place a special terminator weight after the end of each Hangul syllable. More details are in UTS #10.

3 I say 'most' because Han characters and Hangul syllables are left out, as their weights can be derived algorithmically. Maybe I shouldn't say 'most', as most characters are in that group, though.

4 Actually, it's a four-tuple, but the last entry isn't very useful for anything. I think it's the same as the shifted strategy for variable weighting, described in note 5.

5 Actually, to make expansions and contractions work, you shouldn't be iterating over characters; you should iterate over collation elements. For example, in Slovak, 'ch' forms a collation element with a distinct single primary weight.

6 There are many different ways to represent the punctuation, but that is beyond the scope of this article. Generally, we don't refer to it as 'punctuation' but rather 'variable-weighted'. The most useful variant (they all have applications, however) is called 'shifted', adding an algorithmically-derived fourth level of weight to compare variable-weighted characters.

No comments: