i was a bit perplexed by this, mainly as we usually deal with locales which have writing systems that don't have a concept of case but after poking around core java's String class it seems that cf wasn't using the overloaded versions of the toUpperCase()/toLowerCase() methods which pass in a locale to use to handle locale sensitive case. easy enough to fix in cf (i really love how easily coldfusion lets you workaround these little issues):
<cfargument name="inString" required="true" type="string" hint="string to lower case">
<cfargument name="locale" required="false" default="en_US" type="string" hint="java style locale identifier to use to lower case input string">
<cfscript>
var thisLocale="";
var l=listFirst(arguments.locale,"_"); // language
var c=""; // country, we'll ignore variants
if (listLen(arguments.locale,"_") GT 1)
c=uCase(listGetAt(arguments.locale,2,"_"));
// build locale
thisLocale=createObject("java","java.util.Locale").init(l,c);
return arguments.inString.toLowerCase(thisLocale);
</cfscript>
</cffunction>
<cffunction name="toUpperCase" output="false" returntype="string" access="public">
<cfargument name="inString" required="true" type="string" hint="string to upper case">
<cfargument name="locale" required="false" default="en_US" type="string" hint="java style locale identifier to use to upper case input string">
<cfscript>
var thisLocale="";
var l=listFirst(arguments.locale,"_"); // language
var c=""; // country, we'll ignore variants
if (listLen(arguments.locale,"_") GT 1)
c=uCase(listGetAt(arguments.locale,2,"_"));
// build locale
thisLocale=createObject("java","java.util.Locale").init(l,c);
return arguments.inString.toUpperCase(thisLocale);
</cfscript>
</cffunction>
<cfscript>
s="#chr(105)##chr(305)##chr(223)#";
upperS=toUpperCase(s,"tr_TR");
lowerS=toLowerCase(upperS,"TR_TR");
writeoutput("input string: #s#<br> upper case: #upperS#<br>lower case: #lowerS#");
</cfscript>
notice how i didn't have to mess with the core java String class, i could just use it's methods on a cf string.
even if you're not using tr_TR locale, you should note that "ß" (small letter sharp s) is also a special case, upper casing it actually turns it into 2 letters, "SS". i think there might also be some issues with some Greek characters as well.
- uses the latest cldr 1.5.0.1 locale data
- the long discussed rule based timezone changes which gives us the ability to read and write timezone data in RFC2445 VTIMEZONE format as well as also providing access to Olson timezone transitions! this is something many people have been needing for quite some time now, this is going to be very useful
- tawainese calendar (a flavor of gregorian calendar that numbers years since 1912AD)
- the Indian National Calendar (more complicated flavor of the gregorian calendar, eg it's synched up with the gregorian calendar's leap years but the extra day is added to the first month, Chaitra which starts march 22 on gregorian calendar--so, yup, it's complicated)
- charset conversion bugs were fixed and CESU-8, UTF-7 and ISCII converters have been added. also some conversion speed improvements. the UTF-7 one will be useful for email (bounce) handling
- a new MessageFormat type for plurals was added
- a pretty useful new DurationFormat class was added so you can format messages over a duration in time such as "2 days from now" or "3 hours ago"
- also the MessageFormat class will now take named arguments instead of just arrays (too bad now that coldfusion 8's javacast got a shot of steroids)
- new BIDI stuff (which i still need to investigate)
next month i'll be adding the new calendars as CFCs to the usual bits. i'll also be doing some significant changes to most of the i18n formatting methods to take better advantage of the calendar, etc. keywords (en_GB@calendar=indian,currency=EUR) on the ULocale class (icu4j's super cool locale class).
unfortunately the persian calendar still appears to be only in icu4c (C/C++) only.
- it uses the latest and greatest cldr 1.5 locale data
- the long discussed rule based timezone changes which gives us the ability to read and write timezone data in RFC2445 VTIMEZONE format as well as also providing access to Olson timezone transitions! this is stuff many people have been looking for, this is going to be very useful
- tawainese calendar (which i never knew existed, looks like a flavor of gregorian calendar that numbers years since 1912AD)
- the Indian National Calendar (ditto though looks like a more complicated flavor of the gregorian calendar, eg it's synched up with the gregorian calendar's leap years but the extra day is added to the first month, Chaitra which starts march 22 on gregorian calendar--so, yup, it's complicated)
- charset conversion bugs were fixed and CESU-8, UTF-7 and ISCII converters have been added. also some conversion speed improvements. i think the UTF-7 one looks pretty useful
- a new MessageFormat type for plurals was added, looks like some eastern european languages have complicated rules for plurals
- a new DurationFormat class so you can format messages over a duration in time such as "2 days from now" or "3 hours ago" (this one looks useful)
- also the MessageFormat class will now take named arguments instead of just arrays (too bad now that coldfusion 8's javacast got a shot of steroids)
- bunch of new BIDI stuff (which need some investigating)
i'll be adding the new calendars as CFCs to the usual bits as soon as i do enough background research on them to understand any "quirks". i'll also be doing some significant changes to most of the i18n formatting methods to take better advantage of the calendar, etc. keywords (en_GB@calendar=indian,currency=EUR) on the ULocale class (icu4j's super cool locale class).
looks like a persian calendar was also added but appears to be only in icu4c (C/C++) only for the time being.
wow, fun times in the old town tonite (it's actually in the AM in bangkok but you get the idea).
i guess we can expect to see this in JDK 1.6 update 4 (latest update is 2). i wonder if i should just pile all the CLDR vs core java locale differences (there's a lot) into a single java bug report?
one of the side effects of this core java locale is that ColdFusion's old locale name Norwegian (Nynorsk) actually produces no_NO locale data. any legacy apps still using this locale identifier are probably telling people the wrong thing, for example:
writeoutput('#lsDateFormat(now(),"DDDD")#');
produces: mandag
while
setLocale('Norwegian (Nynorsk)');
writeoutput('#lsDateFormat(now(),"DDDD")#');
also produces: mandag
icu4j on the otherhand produces:
måndag for nn_NO
mandag for nb_NO
it looks like ColdFusion got tripped up on the "variant instead of language" locale.
taking this a step further, doing a "FULL" date format shows up even larger differences between core java and icu4j:
core java
8. mai 2006 for no_NO
8. mai 2006 for no_NO_NY
icu4j
måndag 8. mai 2006 for nn_NO
mandag 8. mai 2006 for nb_NO
oops. to my way of thinking, a "FULL" date format should include the day name as well as the rest of the date (date in month, month and year). i really wish ColdFusion would use icu4j.
and the "A-Go-Go" reference? nothing to with g11n or ColdFusion, just been listening to a lot of Dengue Fever lately and that song has just stuck in my head ;-)
and you can now use Java style locale identifiers like ar_AE instead of the "pretty" locale name Arabic (United Arab Emirates), so now it's that much easier to synch up your calls to core Java's ResourceBundle class from cf. and you can buy into all that locale info using the super simple setLocale() function.
of course, as soon as i get what i've asked for after years of asking, i find some new plaything. as you might have read in this blog, icu4j's latest release (3.2) switched to the CLDR's locales, all 232 of them (with 60 more in beta). the graph below compares cf with and without icu4j.
gives you pause, which should i use for locale support? oh my. i'll be revisiting this issue again.
you can pick up the cldr here and read more about it here.
via the unicode mailing list.
in case you're interested, there's also a cldr wiki.
at about the same time there was an announcement on the icu4j mailing list about the next version being built on CLDR data. so i asked if that meant that we'd be able to make use of all the "new" locales in CLDR like farsi, etc. one of the icu4j guys (steven loomis) replied "yes" and further pointed out that icu4j 2.8 was already making use of icu4c's locale data. further discussion with steven helped debunk one of my long held misconceptions, that a java "locale" was a real world "Locale" (ie. the locale bundled up with all it's attendant resource data such as day/month names, etc.). "Locales are just identifiers" says steven, "duh!" says i. while it's convenient to think locales == Locales, formally in java "locale" refers to the identifier and not the data.
so what? what that means, if you're using icu4j for your i18n work (and you should), is that you have access to all the nifty locales that icu4j has no matter what core java supports (or doesn't support in this case). so something like this becomes possible (and easy):
<cfscript>
fullFormat=javacast("int",0);
farsiLocale=createObject("java","java.util.Locale").init("fa","IR");
utcTZ=createObject("java","com.ibm.icu.impl.JDKTimeZone").getTimeZone("UTC");
aDateFormat = createObject("java","com.ibm.icu.text.DateFormat");
aCalendar =createObject("java","com.ibm.icu.util.GregorianCalendar").init(utcTZ,farsiLocale);
dF=aDateFormat.getDateInstance(aCalendar,fullFormat,farsiLocale);
writeoutput("#farsiLocale.getDisplayName(farsiLocale)# #dF.format(now())#<br>");
</cfscript>
which produces:
Persian (Iran) دوشنبه، ۱۸ اکتبر ۲۰۰۴
note that the core java getDisplayName method falls back on "Persian (Iran)" which while not perfect is better than nothing. icu4j 3.0 ULocale class would actually produce the correctly localized name.
the more i work with icu4j, the more impressed i am with how well-thought it is. it really is the bees' knees for i18n work.
thanks to steven for enlightening me.
- better documentation for date/number format patterns (one of my favorites)
- added stuff about references/validity/etc.
- new timezone localization model
- weekend data
- added Oriya,Malayalam,Assamese,Welsh,Dzongkha,Bhutan,Khmer and Lao (woohoo se asian) locales
- added more country,language,currency, and type display name data for ar,bg,cs,el,he,hr,hu,is,mk,pl, ro,ru,sk,sl,sr,tr,uk (the arabic stuff is way cool)
read more on the cldr website. you can compare the cldr versus platform data here. and you can report bugs here.
via the unicode mailing list.
i urge you to double check your locale's data & report any bugs you find. i'd say this is pretty good news for i18n folks.
reported via the unicode mailing list.
so now you know.
1) devise a general XML format for the exchange of culturally sensitive (locale) information for use in application and system development
2) gather, store, and make available data generated in that format
this "kitchen sink" approach goes way beyond the simple HTML concept of locale (which is basically language as used in a location) and includes such groovy stuff like collation, calendars, timezones, measurements, delimiters, etc.
similarly, those cool ICU4J folks have just proposed a LocaleMisc class to be added to their nifty java library that would expose locale info such as exemplar characters, measurements, and paper size (never would have thought of that one).
onward and upward.
while HTTP_ACCEPT_LANGUAGE is usually a single locale or language (th-th or th for example) it can often be a list of languages/locales (especially w/MACs, some of the longest HTTP_ACCEPT_LANGUAGE lists i've ever seen came from MAC browsers though browsers in internet cafe's at major tourist desitinations can get pretty long as well). language preferences are usually listed (comma delimited) in order, with most preferred first and may contain a quality (q) value that represents an estimate of the user's preference for that language range. for instance, "en-us,ko;q=0.5" means i prefer US english but will also accept Korean. whether a value for HTTP_ACCEPT_LANGUAGE exists depends on the browser age and whether a user has set it (for IE that would be via tools, internet options, languages), it also may only contain a language (en) rather than a full locale (en-ca) and we all know how important locale is ;-) because of this i use geoLocator (which determines locale from a users IP) along with HTTP_ACCEPT_LANGUAGE to find and fix a users locale. more info on HTTP_ACCEPT_LANGUAGE can be found here.

