Viewing By Category : locales / Main
August 26, 2008
case shennigans
i really feel for our turkish cf brethren, they always seem to be getting the short end of the stick. a couple of weeks ago there was an issue in the support forums with someone using turkish locale (tr_TR) that was having problems getting case right using coldfusion's uCase() & lCase() functions. there's a couple of special characters, "i" & "ı" (that's small letter i & small letter dotless i) that are special cases when it comes to case mappings (bad pun willfully intended) which cf's functions weren't handling correctly.

i was a bit perplexed by this, mainly as we usually deal with locales which have writing systems that don't have a concept of case but after poking around core java's String class it seems that cf wasn't using the overloaded versions of the toUpperCase()/toLowerCase() methods which pass in a locale to use to handle locale sensitive case. easy enough to fix in cf (i really love how easily coldfusion lets you workaround these little issues):

<cffunction name="toLowerCase" output="false" returntype="string" access="public">
<cfargument name="inString" required="true" type="string" hint="string to lower case">
<cfargument name="locale" required="false" default="en_US" type="string" hint="java style locale identifier to use to lower case input string">
<cfscript>
var thisLocale="";
var l=listFirst(arguments.locale,"_"); // language
var c=""; // country, we'll ignore variants
if (listLen(arguments.locale,"_") GT 1)
      c=uCase(listGetAt(arguments.locale,2,"_"));
// build locale
thisLocale=createObject("java","java.util.Locale").init(l,c);
return arguments.inString.toLowerCase(thisLocale);
</cfscript>
</cffunction>


<cffunction name="toUpperCase" output="false" returntype="string" access="public">
<cfargument name="inString" required="true" type="string" hint="string to upper case">
<cfargument name="locale" required="false" default="en_US" type="string" hint="java style locale identifier to use to upper case input string">
<cfscript>
var thisLocale="";
var l=listFirst(arguments.locale,"_"); // language
var c=""; // country, we'll ignore variants
if (listLen(arguments.locale,"_") GT 1)
      c=uCase(listGetAt(arguments.locale,2,"_"));
// build locale
thisLocale=createObject("java","java.util.Locale").init(l,c);
return arguments.inString.toUpperCase(thisLocale);
</cfscript>
</cffunction>

<cfscript>
s="#chr(105)##chr(305)##chr(223)#";
upperS=toUpperCase(s,"tr_TR");
lowerS=toLowerCase(upperS,"TR_TR");
writeoutput("input string: #s#<br> upper case: #upperS#<br>lower case: #lowerS#");
</cfscript>

notice how i didn't have to mess with the core java String class, i could just use it's methods on a cf string.

even if you're not using tr_TR locale, you should note that "ß" (small letter sharp s) is also a special case, upper casing it actually turns it into 2 letters, "SS". i think there might also be some issues with some Greek characters as well.

September 15, 2007
icu4j 3.8 final released
the final version of icu4j version 3.8 has just been released. to recap what's in this release:

  • uses the latest cldr 1.5.0.1 locale data
  • the long discussed rule based timezone changes which gives us the ability to read and write timezone data in RFC2445 VTIMEZONE format as well as also providing access to Olson timezone transitions! this is something many people have been needing for quite some time now, this is going to be very useful
  • tawainese calendar (a flavor of gregorian calendar that numbers years since 1912AD)
  • the Indian National Calendar (more complicated flavor of the gregorian calendar, eg it's synched up with the gregorian calendar's leap years but the extra day is added to the first month, Chaitra which starts march 22 on gregorian calendar--so, yup, it's complicated)
  • charset conversion bugs were fixed and CESU-8, UTF-7 and ISCII converters have been added. also some conversion speed improvements. the UTF-7 one will be useful for email (bounce) handling
  • a new MessageFormat type for plurals was added
  • a pretty useful new DurationFormat class was added so you can format messages over a duration in time such as "2 days from now" or "3 hours ago"
  • also the MessageFormat class will now take named arguments instead of just arrays (too bad now that coldfusion 8's javacast got a shot of steroids)
  • new BIDI stuff (which i still need to investigate)

next month i'll be adding the new calendars as CFCs to the usual bits. i'll also be doing some significant changes to most of the i18n formatting methods to take better advantage of the calendar, etc. keywords (en_GB@calendar=indian,currency=EUR) on the ULocale class (icu4j's super cool locale class).

unfortunately the persian calendar still appears to be only in icu4c (C/C++) only.

August 9, 2007
icu4j 3.8 draft released
a draft of icu4j version 3.8 has just been released. what's so hot about this release? well a lot actually:

  • it uses the latest and greatest cldr 1.5 locale data
  • the long discussed rule based timezone changes which gives us the ability to read and write timezone data in RFC2445 VTIMEZONE format as well as also providing access to Olson timezone transitions! this is stuff many people have been looking for, this is going to be very useful
  • tawainese calendar (which i never knew existed, looks like a flavor of gregorian calendar that numbers years since 1912AD)
  • the Indian National Calendar (ditto though looks like a more complicated flavor of the gregorian calendar, eg it's synched up with the gregorian calendar's leap years but the extra day is added to the first month, Chaitra which starts march 22 on gregorian calendar--so, yup, it's complicated)
  • charset conversion bugs were fixed and CESU-8, UTF-7 and ISCII converters have been added. also some conversion speed improvements. i think the UTF-7 one looks pretty useful
  • a new MessageFormat type for plurals was added, looks like some eastern european languages have complicated rules for plurals
  • a new DurationFormat class so you can format messages over a duration in time such as "2 days from now" or "3 hours ago" (this one looks useful)
  • also the MessageFormat class will now take named arguments instead of just arrays (too bad now that coldfusion 8's javacast got a shot of steroids)
  • bunch of new BIDI stuff (which need some investigating)

i'll be adding the new calendars as CFCs to the usual bits as soon as i do enough background research on them to understand any "quirks". i'll also be doing some significant changes to most of the i18n formatting methods to take better advantage of the calendar, etc. keywords (en_GB@calendar=indian,currency=EUR) on the ULocale class (icu4j's super cool locale class).

looks like a persian calendar was also added but appears to be only in icu4c (C/C++) only for the time being.

wow, fun times in the old town tonite (it's actually in the AM in bangkok but you get the idea).

August 8, 2007
wow, that was fast
i submitted a bug to sun about australian and new zealand time formats being wrong compared to the CLDR on 18-may (CLDR & some common experience says it should be "h:mm:ss a", ie 12 hour AM/PM format, while core java thinks it should be "H:mm:ss", ie 24hr format). according to this (might require login) it was fixed on 21-may--funny thing is that i was only informed via the bug parade just "now" (7-aug). also funny was that it attracted only 1 vote--what you guys down there all asleep?

i guess we can expect to see this in JDK 1.6 update 4 (latest update is 2). i wonder if i should just pile all the CLDR vs core java locale differences (there's a lot) into a single java bug report?

May 10, 2006
norwegian locale A-Go-Go
some recent work has me again turning over the rocks where core java locales are hiding and once again a closer look at what crawled out reveals just how sweet icu4j's locale support really is. according to several resources, such as ethnologue and the odin archive (gotta love that name), norway has two main written languages Bokmål and Nynorsk, with Bokmål being dominant. in core java there is one (well two if you include the plain norwegian langauge, no) locale and one variant for norway: no_NO and no_NO_NY. assumming core java meant Bokmål for plain Norwegian (no and no_NO), then i suppose the variant (no_NO_NY) is for Nynorsk. huh? but i thought Nynorsk was a language? why is it a variant here? in icu4j, which uses the CLDR for it's locale data, we can see two locales (four if you count the plain nb/Bokmål and nn/Nynorsk languages): nb_NO (Bokmål Norwegian) and nn_NO (Nynorsk Norwegian). neat and tidy.

one of the side effects of this core java locale is that ColdFusion's old locale name Norwegian (Nynorsk) actually produces no_NO locale data. any legacy apps still using this locale identifier are probably telling people the wrong thing, for example:

setLocale('Norwegian (Bokmal)');
writeoutput('#lsDateFormat(now(),"DDDD")#');
produces: mandag

while
setLocale('Norwegian (Nynorsk)');
writeoutput('#lsDateFormat(now(),"DDDD")#');
also produces: mandag

icu4j on the otherhand produces:
måndag for nn_NO
mandag for nb_NO

it looks like ColdFusion got tripped up on the "variant instead of language" locale.

taking this a step further, doing a "FULL" date format shows up even larger differences between core java and icu4j:

core java
8. mai 2006 for no_NO
8. mai 2006 for no_NO_NY

icu4j
måndag 8. mai 2006 for nn_NO
mandag 8. mai 2006 for nb_NO

oops. to my way of thinking, a "FULL" date format should include the day name as well as the rest of the date (date in month, month and year). i really wish ColdFusion would use icu4j.

and the "A-Go-Go" reference? nothing to with g11n or ColdFusion, just been listening to a lot of Dengue Fever lately and that song has just stuck in my head ;-)

February 7, 2005
blackstone locales
maybe i didn't look hard enough but i haven't seen any mention about locales in any of the blogs/articles/etc. concerning the release of blackstone (now officially known as ColdFusion MX 7). ditto during the beta pr period. no idea about why this was but it's sure like hiding your light under a bushel. if you're a g11n developer, Blackstone's going to be a real eye-opener. core Java's locales are now Blackstone's locales. from the measly 20 odd locales in cfmx 6.1, Blackstone gives us 130. the figure below compares locale support across different versions of cf. pretty cool, huh?

cf supported locales

and you can now use Java style locale identifiers like ar_AE instead of the "pretty" locale name Arabic (United Arab Emirates), so now it's that much easier to synch up your calls to core Java's ResourceBundle class from cf. and you can buy into all that locale info using the super simple setLocale() function.

of course, as soon as i get what i've asked for after years of asking, i find some new plaything. as you might have read in this blog, icu4j's latest release (3.2) switched to the CLDR's locales, all 232 of them (with 60 more in beta). the graph below compares cf with and without icu4j.

cf w/icu4j supported locales

gives you pause, which should i use for locale support? oh my. i'll be revisiting this issue again.

November 5, 2004
cldr 1.2 released
the unicode consortium has announced the release of version 1.2 of the Common Locale Data Repository (cldr). quoting the press release, the latest version contains "232 locales, covering 72 languages and 108 territories. There are also 63 draft locales in the process of being developed, covering an additional 27 languages and 28 territories". wow.

you can pick up the cldr here and read more about it here.

via the unicode mailing list.

October 21, 2004
cldr 1.2 in beta
the latest version of the cldr (1.2) has entered beta. of particular interest are the 'interim vetting charts' which gives you a sneak preview of what's been changed & what's coming for the release version. many of these are "common" changes such as localized territory names, etc. but there are some local stuff that's been "fixed".

in case you're interested, there's also a cldr wiki.

October 18, 2004
when a locale isn't a Locale
there was a recent discussion concerning using farsi (persian) language with cf. my first reaction was to point out that farsi locales (fa_IR iran and fa_AF afghanistan) weren't supported java locales, so that was that.

at about the same time there was an announcement on the icu4j mailing list about the next version being built on CLDR data. so i asked if that meant that we'd be able to make use of all the "new" locales in CLDR like farsi, etc. one of the icu4j guys (steven loomis) replied "yes" and further pointed out that icu4j 2.8 was already making use of icu4c's locale data. further discussion with steven helped debunk one of my long held misconceptions, that a java "locale" was a real world "Locale" (ie. the locale bundled up with all it's attendant resource data such as day/month names, etc.). "Locales are just identifiers" says steven, "duh!" says i. while it's convenient to think locales == Locales, formally in java "locale" refers to the identifier and not the data.

so what? what that means, if you're using icu4j for your i18n work (and you should), is that you have access to all the nifty locales that icu4j has no matter what core java supports (or doesn't support in this case). so something like this becomes possible (and easy):

<cfscript>
fullFormat=javacast("int",0);
farsiLocale=createObject("java","java.util.Locale").init("fa","IR");
utcTZ=createObject("java","com.ibm.icu.impl.JDKTimeZone").getTimeZone("UTC");
aDateFormat = createObject("java","com.ibm.icu.text.DateFormat");
aCalendar =createObject("java","com.ibm.icu.util.GregorianCalendar").init(utcTZ,farsiLocale);
dF=aDateFormat.getDateInstance(aCalendar,fullFormat,farsiLocale);
writeoutput("#farsiLocale.getDisplayName(farsiLocale)# #dF.format(now())#<br>");
</cfscript>

which produces:

Persian (Iran) دوشنبه، ۱۸ اکتبر ۲۰۰۴

note that the core java getDisplayName method falls back on "Persian (Iran)" which while not perfect is better than nothing. icu4j 3.0 ULocale class would actually produce the correctly localized name.

the more i work with icu4j, the more impressed i am with how well-thought it is. it really is the bees' knees for i18n work.

thanks to steven for enlightening me.

October 1, 2004
cldr 1.2 alpha
unicode has just announced the public release of the alpha version of the cldr (Common Locale Data Repository). some of the highlights include:

  • better documentation for date/number format patterns (one of my favorites)
  • added stuff about references/validity/etc.
  • new timezone localization model
  • weekend data
  • added Oriya,Malayalam,Assamese,Welsh,Dzongkha,Bhutan,Khmer and Lao (woohoo se asian) locales
  • added more country,language,currency, and type display name data for ar,bg,cs,el,he,hr,hu,is,mk,pl, ro,ru,sk,sl,sr,tr,uk (the arabic stuff is way cool)

read more on the cldr website. you can compare the cldr versus platform data here. and you can report bugs here.

via the unicode mailing list.

June 8, 2004
Common Locale Data Repository 1.1 released
hot off the press, the unicode organization has released version 1.1 of the Common Locale Data Repository. its got 50% more data than 1.0, 247 locales spead over 78 languages and 118 countries combinations. the news article indicates that there are also "36 draft locales" in the queue. the repository access page further states that this is "a stable release and may be used as reference material or cited as a normative reference by other specifications." yeah, finally somebody to blame ;-) you can either do a web CVS or download a zip archive of the CLDR from that page.

i urge you to double check your locale's data & report any bugs you find. i'd say this is pretty good news for i18n folks.

reported via the unicode mailing list.

February 26, 2004
new W3C i18n faq
the W3C has just issued a new i18n faq related to language negotiation. it discusses the just about absolute need for language negotiation on good i18n web sites, examining the old standby of HTTP Accept-Language header (i use that in combination with geolocator CFC) as well as stressing the need for manual language swapping (couldn't agree more). another important but sometimes overlooked point is "navigation stickiness", basically remembering which language a user has selected (in cf via cookies or session vars) & always serving content in that language. another interesting point (to me anyway) was a trick to also look at User-Agent header which sometimes also contains language (besides all that boring browser version, etc. stuff). cool. i'm going to look at adding that to the geoLocator CFC when Accept-Language is empty.

so now you know.

November 1, 2003
locales march on
the common XML locale data repository (CLDR) has gone to beta. the purpose of this project, in case you can't recall, is two-fold (quoting the CLDR site):

1) devise a general XML format for the exchange of culturally sensitive (locale) information for use in application and system development

2) gather, store, and make available data generated in that format

this "kitchen sink" approach goes way beyond the simple HTML concept of locale (which is basically language as used in a location) and includes such groovy stuff like collation, calendars, timezones, measurements, delimiters, etc.

similarly, those cool ICU4J folks have just proposed a LocaleMisc class to be added to their nifty java library that would expose locale info such as exemplar characters, measurements, and paper size (never would have thought of that one).

onward and upward.

June 24, 2003
language negotiation
one important CGI variable from the G11N perspective is HTTP_ACCEPT_LANGUAGE. why? because it represents what language/locale the user wants as opposed to what cf might be able to deliver (via setEncoding(), cfProcessingDirective, cfcontent and the actual dynamic content). matching what the user wants and what your app can deliver is often called "language negotiation".

while HTTP_ACCEPT_LANGUAGE is usually a single locale or language (th-th or th for example) it can often be a list of languages/locales (especially w/MACs, some of the longest HTTP_ACCEPT_LANGUAGE lists i've ever seen came from MAC browsers though browsers in internet cafe's at major tourist desitinations can get pretty long as well). language preferences are usually listed (comma delimited) in order, with most preferred first and may contain a quality (q) value that represents an estimate of the user's preference for that language range. for instance, "en-us,ko;q=0.5" means i prefer US english but will also accept Korean. whether a value for HTTP_ACCEPT_LANGUAGE exists depends on the browser age and whether a user has set it (for IE that would be via tools, internet options, languages), it also may only contain a language (en) rather than a full locale (en-ca) and we all know how important locale is ;-) because of this i use geoLocator (which determines locale from a users IP) along with HTTP_ACCEPT_LANGUAGE to find and fix a users locale. more info on HTTP_ACCEPT_LANGUAGE can be found here.