Viewing By Category : collation / Main
June 3, 2004
i18nSort CFC updated
we updated the i18nSort CFC to handle queries (single column sort key). you can eventually pick it up in the usual places (in the meantime you can grab it from here). if its of any interest, we used collationKeys rather than straight up compare(). the collationKey creation overhead didn't make much difference with smaller queries but it pays off quite nicely with larger queries:

you can read more details about collation here.

May 26, 2004
icu4j beta/collation
ibm has released another beta version of its supercool icu4j. these betas are also released as an executable JAR (i only noticed this with the first beta for 3.0), so you can jump right into testing.

while i was perusing the icu4j site i stumbled across this interesting page: collation performance comparison. wow! icu4j beats the snot out of the plain java JDK for collation over most locales (except for ja_JP and ko_KR locales, note that locales <> collation). i know that collation is of some interest to many i18n folks, so this is kind of interesting news.

February 8, 2004
locale collation
hiroshi okugawa (mm) and i were working on an issue last week in the forums where one user was having trouble sorting a list to german phonebook, sometimes called DIN-2, collation (string sort order). the problem was that listSort and arraySort cf functions sort based on straight up unicode codepoint values. while this will work for most folks, after all 'a' < 'b' is true for both lexigraphical (dictionary) and unicode orders, it won't work for folks with characters like german umlauts ÄËÜ which have higher unicode values than the unadorned chars AEU, ie. ÄËÜ will always sort as a group after AEU rather than the AÄEËUÜ order which folks in that locale would expect. since i mainly use sql server as my db backend, which has a very nifty COLLATE clause that allows you to cast your resultset to a specific collation, this came as a bit of a surprise to me.

the solution to this, as usual for i18n issues in cf, is to dip down into the underlying java functionality, specifically the java.text.Collator class which allows you to "perform locale-sensitive String comparison". we developed a CFC, i18nSort to wrap up this functionality. we also added a sort method based on IBM's ICU4J com.ibm.icu.text.Collator class. why? because ICU4J provides a much beefier set of collation locales (246 vs the java class's 134) including afrikaans, german phonebook, various european locales pre-euro (which would be useful for historical data), persian (both iran and afghanistan), traditional thai, etc.

collation is a strange beast. it's pretty much a universal user requirement but is not consistent for the same chars (germans, french and swedes sort the same chars differently) nor within the same language (so-called phonebook collation vs dictionaries or book indices). and that's just the alphabet-based scripts--asian ideograph collation can be either phonetic or based on the appearance (strokes) of the character. then there's the special cases based on user preferences: ignore/consider punctuation, case ('A' before/after 'a'), etc. you're looking at thousands of years of people's collation baggage, so yes it's going to be complex. you can read more about unicode's take on collation here.

in java (both "plain" java and ICU4J) collation complexity is handled using three parameters: locale, strength, and decomposition. locale is obvious, specific locales' collation data is used to order sorts (and searches). strength is used across locales (though exact strength assignments vary from locale to locale) and determines the level of difference considered significant in comparisons. there are four basic strengths (ICU4J adds a fifth, QUATERNARY which distinguishes words with/without punctuation):

- PRIMARY: significant for base letter differences 'a' vs 'b'. - SECONDARY: significant for different accented forms of the same base letter ('o' vs 'ô'). - TERTIARY: signficant for case differences such as 'a' vs 'A' (but again differs locale to locale). - IDENTICAL: all differences are considered significant during comparison (control chars, pre-composed and combining accents, etc.).

taking an example from the java docs in czech, "e" and "f" are considered primary differences, while "e" and "?" are secondary differences, "e" and "E" are tertiary differences and "e" and "e" are identical.

decomposition is just that, chars are decomposed for comparison. there are three basic decompositions (only two for ICU4J):

- NO_DECOMPOSITION: chars are not decomposed, accented and plain chars are the same, this is the fastest collation but will only work for languages without accented, etc. chars. - CANONICAL_DECOMPOSITION: chars that are canonical variants are decomposed for collation, ie. accents are handled. - FULL_DECOMPOSITION: not only accented chars, but also chars that have special formats are decomposed (this decomposition doesn't exist in ICU4J, CANONICAL_DECOMPOSITION is used instead). basically un-normalized text is properly handled.

so now you know.