Viewing By Category : I18N / Main
January 16, 2009
icu4j 4.01 released
the icu4j project has just released version 4.01. its a regular maintenance release with the following changes (common across all flavors):
  • Unicode 5.1
  • locale data: Common Locale Data Repository (CLDR) 1.6
  • charset converter file size improvement
  • date interval formatting (note only gregorian calendar is supported n this release)
  • improved plural support

specific icu4j changes include:

  • charset
    • ICU2022 converter
    • HZ converter
    • SCSU/BOCU-1 converter
    • charset converter callback
  • thai dictionary break iterator (yeah)
  • JDK TimeZone support (this is pretty decent as you can now share tz IDs between coldfusion/core java & icu4j)
  • locale service provider
  • more convenient formatting of year+month, day+month, and other combinations
  • simple duration formatting
i guess it's time to update the icu4j CFCs for the new formatting bits. as usual you can download the new version from here. btw you can still get a hold of the icu4j tools here.

December 9, 2008
cook up some java style I18N
eirik rude (made i18n famous for his rokuyo calculations) has cooked up a new i18n reference site called, strangely enough, i18n cookbook ;-) he covers pretty much all the main i18n points including:
  • locales
  • date and time formatting
  • numerical formatting
  • resource bundles
  • unicode, transliteration, and character sets

right now its all java content but as you know from reading this blog, it's still very applicable to coldfusion. so have a look, it's worth a visit.

ps: he's promised to add flex and coldfusion content. so let's all hold his feet to the fire ;-)

August 26, 2008
case shennigans
i really feel for our turkish cf brethren, they always seem to be getting the short end of the stick. a couple of weeks ago there was an issue in the support forums with someone using turkish locale (tr_TR) that was having problems getting case right using coldfusion's uCase() & lCase() functions. there's a couple of special characters, "i" & "ı" (that's small letter i & small letter dotless i) that are special cases when it comes to case mappings (bad pun willfully intended) which cf's functions weren't handling correctly.

i was a bit perplexed by this, mainly as we usually deal with locales which have writing systems that don't have a concept of case but after poking around core java's String class it seems that cf wasn't using the overloaded versions of the toUpperCase()/toLowerCase() methods which pass in a locale to use to handle locale sensitive case. easy enough to fix in cf (i really love how easily coldfusion lets you workaround these little issues):

<cffunction name="toLowerCase" output="false" returntype="string" access="public">
<cfargument name="inString" required="true" type="string" hint="string to lower case">
<cfargument name="locale" required="false" default="en_US" type="string" hint="java style locale identifier to use to lower case input string">
<cfscript>
var thisLocale="";
var l=listFirst(arguments.locale,"_"); // language
var c=""; // country, we'll ignore variants
if (listLen(arguments.locale,"_") GT 1)
      c=uCase(listGetAt(arguments.locale,2,"_"));
// build locale
thisLocale=createObject("java","java.util.Locale").init(l,c);
return arguments.inString.toLowerCase(thisLocale);
</cfscript>
</cffunction>


<cffunction name="toUpperCase" output="false" returntype="string" access="public">
<cfargument name="inString" required="true" type="string" hint="string to upper case">
<cfargument name="locale" required="false" default="en_US" type="string" hint="java style locale identifier to use to upper case input string">
<cfscript>
var thisLocale="";
var l=listFirst(arguments.locale,"_"); // language
var c=""; // country, we'll ignore variants
if (listLen(arguments.locale,"_") GT 1)
      c=uCase(listGetAt(arguments.locale,2,"_"));
// build locale
thisLocale=createObject("java","java.util.Locale").init(l,c);
return arguments.inString.toUpperCase(thisLocale);
</cfscript>
</cffunction>

<cfscript>
s="#chr(105)##chr(305)##chr(223)#";
upperS=toUpperCase(s,"tr_TR");
lowerS=toLowerCase(upperS,"TR_TR");
writeoutput("input string: #s#<br> upper case: #upperS#<br>lower case: #lowerS#");
</cfscript>

notice how i didn't have to mess with the core java String class, i could just use it's methods on a cf string.

even if you're not using tr_TR locale, you should note that "ß" (small letter sharp s) is also a special case, upper casing it actually turns it into 2 letters, "SS". i think there might also be some issues with some Greek characters as well.

July 11, 2008
icu4j 4.0 hits the streets
the latest version of the super cool icu4j i18n library has been released. the big changes (to me) are:
  • that it has upgraded it's resource data to Unicode 5.1 and CLDR 1.6
  • added date interval formatting (ie Jan 10, 2008 to Jan 20, 2008 becomes Jan 10-20, 2008, 10:10am to 11:10am becomes 10:10-11:10am, etc.). downside is that currently it's only gregorian calendar)
  • added DurationFormat so you can now format over a duration in time such as "2 days from now" or "3 hours ago".
  • added "Locale Service Provide" support for core java's new locale service--many folks just want the filthy-rich and frequently-updated locale data that icu4j has and not the whole library. i wonder if there is a way to backdoor this into coldfusion's locales?

you can grab the jar files/api docs and read more about the new stuff here.

June 17, 2008
whose june 17th was that again?
june 17th has been touted as "firefox download day". while i'm a long term firefox user, this june 17th business just annoys me no end. june 17th where? what time? what timezone? i've looked fairly hard for any details but all i see is this standalone, kind of useless date of june 17th.

how on earth do you think you can coordinate a global project by not giving folks useful info? geez.

i've been hitting the download link throughout the day, thinking maybe the mozilla folks were all east US coasters (really no idea, just a WAG) & i'd see something around noon here in bangkok. nope. nothing. butkis. just version 2.0.0.14.

oh well. in case anybody's missed the link, go here: http://www.spreadfirefox.com/.

September 15, 2007
icu4j 3.8 final released
the final version of icu4j version 3.8 has just been released. to recap what's in this release:

  • uses the latest cldr 1.5.0.1 locale data
  • the long discussed rule based timezone changes which gives us the ability to read and write timezone data in RFC2445 VTIMEZONE format as well as also providing access to Olson timezone transitions! this is something many people have been needing for quite some time now, this is going to be very useful
  • tawainese calendar (a flavor of gregorian calendar that numbers years since 1912AD)
  • the Indian National Calendar (more complicated flavor of the gregorian calendar, eg it's synched up with the gregorian calendar's leap years but the extra day is added to the first month, Chaitra which starts march 22 on gregorian calendar--so, yup, it's complicated)
  • charset conversion bugs were fixed and CESU-8, UTF-7 and ISCII converters have been added. also some conversion speed improvements. the UTF-7 one will be useful for email (bounce) handling
  • a new MessageFormat type for plurals was added
  • a pretty useful new DurationFormat class was added so you can format messages over a duration in time such as "2 days from now" or "3 hours ago"
  • also the MessageFormat class will now take named arguments instead of just arrays (too bad now that coldfusion 8's javacast got a shot of steroids)
  • new BIDI stuff (which i still need to investigate)

next month i'll be adding the new calendars as CFCs to the usual bits. i'll also be doing some significant changes to most of the i18n formatting methods to take better advantage of the calendar, etc. keywords (en_GB@calendar=indian,currency=EUR) on the ULocale class (icu4j's super cool locale class).

unfortunately the persian calendar still appears to be only in icu4c (C/C++) only.

August 9, 2007
icu4j 3.8 draft released
a draft of icu4j version 3.8 has just been released. what's so hot about this release? well a lot actually:

  • it uses the latest and greatest cldr 1.5 locale data
  • the long discussed rule based timezone changes which gives us the ability to read and write timezone data in RFC2445 VTIMEZONE format as well as also providing access to Olson timezone transitions! this is stuff many people have been looking for, this is going to be very useful
  • tawainese calendar (which i never knew existed, looks like a flavor of gregorian calendar that numbers years since 1912AD)
  • the Indian National Calendar (ditto though looks like a more complicated flavor of the gregorian calendar, eg it's synched up with the gregorian calendar's leap years but the extra day is added to the first month, Chaitra which starts march 22 on gregorian calendar--so, yup, it's complicated)
  • charset conversion bugs were fixed and CESU-8, UTF-7 and ISCII converters have been added. also some conversion speed improvements. i think the UTF-7 one looks pretty useful
  • a new MessageFormat type for plurals was added, looks like some eastern european languages have complicated rules for plurals
  • a new DurationFormat class so you can format messages over a duration in time such as "2 days from now" or "3 hours ago" (this one looks useful)
  • also the MessageFormat class will now take named arguments instead of just arrays (too bad now that coldfusion 8's javacast got a shot of steroids)
  • bunch of new BIDI stuff (which need some investigating)

i'll be adding the new calendars as CFCs to the usual bits as soon as i do enough background research on them to understand any "quirks". i'll also be doing some significant changes to most of the i18n formatting methods to take better advantage of the calendar, etc. keywords (en_GB@calendar=indian,currency=EUR) on the ULocale class (icu4j's super cool locale class).

looks like a persian calendar was also added but appears to be only in icu4c (C/C++) only for the time being.

wow, fun times in the old town tonite (it's actually in the AM in bangkok but you get the idea).

August 8, 2007
wow, that was fast
i submitted a bug to sun about australian and new zealand time formats being wrong compared to the CLDR on 18-may (CLDR & some common experience says it should be "h:mm:ss a", ie 12 hour AM/PM format, while core java thinks it should be "H:mm:ss", ie 24hr format). according to this (might require login) it was fixed on 21-may--funny thing is that i was only informed via the bug parade just "now" (7-aug). also funny was that it attracted only 1 vote--what you guys down there all asleep?

i guess we can expect to see this in JDK 1.6 update 4 (latest update is 2). i wonder if i should just pile all the CLDR vs core java locale differences (there's a lot) into a single java bug report?

August 1, 2007
PHP i18n
normally i would say that PHP's unicode/i18n support is fairly lame compared to coldfusion (actually i'd call it a joke but i'm not trying to be controversial here). well i stumbled on an interesting line on the ICU site concerning how PHP 6 would be using the ICU library (icu4j's sister C/C++ library). i was sort of shocked that PHP was considering this (hey PHP is lame after all), so thinking maybe this was market-speak or just plain wishful thinking, i googled it and turned up plenty of references including this article.

first this article confirms that PHP's unicode/i18n support really is lame (also see this article for a bit older take on PHP's unicode/i18n support, i especially liked the Unicode should have been in PHP five years ago quote). but more importantly, and what's surprising to me, is that they're actually doing something about it by adopting ICU. going from being an i18n joke to fully supporting unicode/i18n via the ICU project. i know next to nothing about the PHP world so i have no idea if this is really happening (or has already happened) or is just hot air but it looks like they're on the right track with ICU.

wonder if there's a lesson here?

July 26, 2007
scorpio's i18n changes
in case you were wondering, the main i18n changes for scorpio (coldfusion 8) really revolved around upgrading coldfsion's JDK to version 6. what did that buy us? well core Java's first set of locales based on CLDR data:

  • zh_SG - Chinese (Simplified), Singapore
  • en_MT - English, Malta
  • en_PH - English, Philippines
  • en_SG - English, Singapore
  • el_CY - Greek, Cyprus
  • id_ID - Indonesian, Indonesia
  • ga_IE - Irish, Ireland
  • ms_MY - Malay, Malaysia
  • mt_MT - Maltese, Malta
  • pt_BR - Portuguese, Brazil
  • pt_PT - Portuguese, Portugal
  • es_US - Spanish, United States

hopefully this trend will continue.

beyond the new locale data it also provides support for the Japanese Imperial Calendar which we can tap into for date conversion and formatting simply by setting coldfusion's locale to the new JP variant:

<cfscript>
// set appropriate locale
setLocale("ja_JP_JP");
// Japanese Imperial Calendar date format writeoutput("#lsDateFormat(now(),"FULL")#");
</cfscript>

which should give you something like: 平成19年7月26日 how cool is that?

for more details on the new i18n bits in JDK 6 see this.

May 19, 2007
God helps those who help themselves
since it looks like they'll be playing ice hockey in hell before ColdFusion makes use of the very cool icu4j library, i figure we better start helping core java get it's locale resource act together. so lets start somewhere near my neighborhood, australia & new zealand.

core java's locale data for en_AU (Australia) and en_NZ (New Zealand) time formats is a bit off. it uses a format of H:mm:ss where the "H" stands for 24 hour clock, ie 5:00 PM would be formatted as 17:00. the CLDR (common locale data repository) however states that the time format for en_Au & en_NZ locales is h:mm:ss a (well actually it's proposed to include the timezone, "h:mm:ss a z" see the en_AU time format entry here). while most users in those locales are smart enough to get that 17:00 is 5:00 PM when your ColdFusion app outputs time values, it would play havoc when ColdFusion tries to parse what those same folks would normally input for a time value.

so hey en_AU and en_NZ locale people, time to start helping yourselves. Sun has accepted this as a new bug, go vote for it (you have to be a member of the Sun Developer Network to vote but these days, who isn't).

October 14, 2006
machine translation goes to war
anyone who knows me knows that i think machine translation is so much humbug. round tripping machine translated text through pretty much any of the publicly available translation engines will more often than not give you back garbage, usually indicating the original translation is also garbage. my most frequent example is the phrase “This side towards enemy” that was helpfully placed on the business end of claymore mines. well maybe IBM is about to prove me wrong, as they are deploying 35 laptops equipped with their Mastor (Multilingual Automatic Speech-to-Speech Translator) product to Iraq. Mastor employs a “combination of speech recognition, machine translation and text-to-speech rules” which “allowed IBM to develop a common translation engine that is independent of languages.“ the system has library data for english to Iraqi Arabic, modern standard Arabic and Mandarin translations--i find the language choices interesting, though no Farsi (Persian).

i just hope it works for the sake of those using it.

July 27, 2006
me too....scorpio i18n wishlist
i've never seen a bandwagon i didn't want to jump on, so i'm jumping aboard this round of ColdFusion wishlist blog articles with my own i18n one. but unlike the other wishlists, my i18n list is rather short and sweet. why? over the years ColdFusion has more or less answered the majority of my i18n needs. unicode capability? got it. java locales? yup. and the introduction of CFCs pretty much plugged in the other i18n holes (non-gregorian calendars, locale based collation, etc.). so what do i think ColdFusion still needs in terms of i18n?

  • native resource bundles (heck flex 2.0 got them, but frankly that's about all it got in terms of i18n)
  • setTimeZone() function that might allow me to find my way out of timezone hell
  • use icu4j library (used in a modular/plugin fashion, one of the really sweet things aboout this project is how often it's updated with new functionality and improved locale data from the CLDR). this would buy us better locale data, offer easier access to non-gregorian calendars, etc.

and that's it.

i guess you can take this posting as a stealthy complement to the good work the CF team has done over the years to get ColdFusion to it's current i18n state.

March 6, 2006
"remote" classpath revisited
i seem to have gotten myself into the habit of calling spike's cool "Loading java class files from a relative path" technique as the "remote classpath" technique--i guess i can blame christian cantrell for that. in any case, this technique works very well in most cases where you don't have access to a server's classpath (most shared hosts for example). where it tends not to work is, from my experience, with java classes that don't have "blind" constructors, ie where no arguments are required to initialize that class. classes like icu4j calendars, formatters, etc. usually work just fine but classes like icu4j's ULocale or MessageFormat don't as these require something to be passed to their constructors. for these classes (which are darned important to me) something like this fails:

<cfscript>
// remote init jarFile=jarLocation & "icu4j.jar";
URLObject = createObject('java','java.net.URL');
URLObject.init("file:" & jarFile);
URLArray = createObject("java","java.lang.reflect.Array").
newInstance(URLObject.getClass(),1);
arrayClass = createObject("java","java.lang.reflect.Array");
arrayClass.set(URLArray,0,URLObject);
loader = createObject("java","java.net.URLClassLoader");
loader.init(URLArray);
uLocale=loader.loadClass("com.ibm.icu.util.ULocale").newInstance();   
</cfscript>
<cfdump var="#uLocale#">

while i've managed to workaround this issue (ULocales are everywhere in icu4j, most classes that deal with locales have a getAvailableULocales() method) it's always kind of nagged at me. after a bit of poking and prodding i started looking into ways to get at the actual constructors for a given class:

// remote init
jarFile=jarLocation & "icu4j.jar";
URLObject = createObject('java','java.net.URL');
URLObject.init("file:" & jarFile);
URLArray = createObject("java","java.lang.reflect.Array").
newInstance(URLObject.getClass(),1);
arrayClass = createObject("java","java.lang.reflect.Array");
arrayClass.set(URLArray,0,URLObject);
loader = createObject("java","java.net.URLClassLoader");
loader.init(URLArray);
uLocale=loader.loadClass("com.ibm.icu.util.ULocale"); // don't init c=uLocale.getConstructors();
for (j=1; j LTE arrayLen(c); j=j+1) {
   params=c[j].getParameterTypes();
   for (i=1; i LTE arrayLen(params); i=i+1) {
      writeoutput("ULocale[#j#]: #i# #params[i].getName()#<br>");
   }
   writeoutput("<br>");
}   
</cfscript>

which in this case returned 3 constructors (just like the API says but not in the javadocs order):

ULocale[1]: 1 java.lang.String ULocale[1]: 2 java.lang.String ULocale[1]: 3 java.lang.String

ULocale[2]: 1 java.lang.String

ULocale[3]: 1 java.lang.String ULocale[3]: 2 java.lang.String

which i can easily match to the one i want (ULocale("th_TH")):

<cfscript>
// remote init jarFile=jarLocation & "icu4j.jar";
URLObject = createObject('java','java.net.URL');
URLObject.init("file:" & jarFile);
URLArray = createObject("java","java.lang.reflect.Array").
newInstance(URLObject.getClass(),1);
arrayClass = createObject("java","java.lang.reflect.Array");
arrayClass.set(URLArray,0,URLObject);
loader = createObject("java","java.net.URLClassLoader");
loader.init(URLArray);
uLocale=loader.loadClass("com.ibm.icu.util.ULocale");   
c=uLocale.getConstructors();
// the newInstance method wants an array
obj=listToArray("th_TH");
// we want the 2nd constructor
thaiLocale=c[2].newInstance(obj.toArray());
</cfscript>

<cfdump var="#thaiLocale#">

which indeed returns an object of com.ibm.icu.util.ULocale.

since in most cases, i only use one way to init a given class, this technique will work OK for us. my only question is will the order of constructors remain the same? can i always count on the 2nd constructor to be ULocale("th_TH")? or should i build metadata functionality to probe the constructors to see which one matches?

ps: i did indeed learn my lesson, notice how i passed the coldfusion array using toArray() ;-)

February 21, 2006
good i18n practices really are good
an i18n-related issue popped up on the cfeclipse list yesterday that reinforced (at least to me) that good i18n practices really are good. a user had their eclipse encoding setup as UTF-8 yet was getting their unicode coldfusion pages garbaged. my first look at this used code from our existing codebase and of course it worked. for the life of me, well for 2-3 hours anyway, i couldn't see how this was going wrong. it wasn't until i whipped up a simple dummy page that just had unicode text and nothing else that i was able to see the problem. the issue is simple but clearly illustrates a good i18n practice.

eclipse (not cfeclipse) doesn't add a BOM to UTF-8 encoded files. why? well

  • the BOM isn't actually required as part of the definition of UTF-8 (and i know of plenty of s/w that either doesn't write one out or in fact strips them from files)
  • in the past (i think) the java compiler wouldn't compile a file w/a BOM & since that's what eclipse was originally meant for, NOT having a BOM makes perfect sense (from a very a quick test i just ran it seems this is no longer true, at least from within eclipse)

so why was our cfeclipse-edited UTF-8 encoded code working? because we follow our own good i18n practices and liberally use encoding hinting starting with the cfprocessingdirective. each of our coldfusion pages starts with:

<cfprocessingdirective pageencoding="utf-8">

BOM or no BOM, this ensures your code will be always be interpreted as UTF-8. for more good i18n practices grab a copy of the advanced coldfusion book.

see? good i18n practices really are good.

May 29, 2005
turkish cf forum
the turkish CFUG has just started up a turkish language cf forums. as you might know turkish is particularly difficult to handle. this is probably a good place to look for help when those difficulties rear up and head-butt you.

via Oğuz Demirkapı's blog.

April 28, 2005
new sun i18n content
sun has released the latest version of its eGADC Newsletter for folks "who want to know about the latest internationalization and localization developments at Sun". among the more interesting content: you can find sun's g11n site here. and if you're so inclined, you can subscribe to the newsletter here.

March 5, 2005
persianCalendar update
a few days ago Dr. Ghasem Kiani updated his persianCalendar class to be "more" icu4j like. i wrapped it up in CFC and added it to the i18nCalendars package (which now contains 7, count 'em, 7 calendars). you can see it on it's own in a simple testbed here. you can download the persian calendar class from Dr. Ghasem's sourceforge project.

note that this version of the persian calendar uses a "well-known arithmetic algorithm for calculating the leap years" rather than astronomical calculations.

i'd like to publicly thank Dr. Ghasem Kiani for his work on this project, we've been waiting quite a while for a persian calendar to round off our i18n calendars. thanks.

January 6, 2005
back to our regularly scheduled i18n programming
a couple more non-tsunami i18n bits of information.

that i18n guy about town, tex texin, has put together a good document concerning the use of RFC 3066 language identifiers. you might lend a hand by perusing the table for any funny business (maybe like sinhalese in thailand--but hey, what do i know).

and just when i thought i knew everything about encoding (maybe because i actually think all you really have to know is Just Use Unicode), i find out something new. while doing some research in the java i18n forums i stumbled onto a really nifty java encoding resource, part of a java and internet glossary. i especially liked the term armouring (which i had never heard used in this context before): Converting binary data into printable gibberish so that data transport systems will not corrupt it. so that's what it's called.

December 15, 2004
two new i18n tidbits
first, the latest version of the Unicode Standard (4.1.0) which is due out in march, 2005 is now in beta. some of the new stuff i find interesting are:
  • newly added complete scripts such as new Tai Lue script (it's used in the yunnan area of southern china and south to northern thailand) among others
  • "very significant extensions to the repertoire for the Arabic script"
  • new chars were added to support "roundtrip mapping support for HKSCS and GB 18030"
  • i also find it interesting that "106 CJK compatibility ideographs has been added to support roundtrip mapping to the DPRK standard"--you know, north korea

now, i guess i'm going to have to rework my uBlock CFC. you can read more about the new unicode beta here.

next since i'm always ragging on core java's i18n support, i'd thought i'd point out a nifty new tech tip at Core Java Technologies Tech Tips dealing with resource bundles. this tech tip examines when and where you should be using ListResourceBundle vs PropertyResourceBundle. we normally use PropertyResourceBundle when applications can't access the classpath (ala the javaRB CFC) and plain ResourceBundle when it can (with rbJava CFC). as an added benefit this article gets into some testing using java 5.0 (or 1.5) new nanoTime() method (as in nanoseconds) as well as offering a link to a java one presentation on how not to write a benchmark.

both are pretty good reading.

October 26, 2004
persian calendar
persian calendars in cf seem to have come up a bit lately and since monday was a holiday here in the big mango i had a few hours to put into slapping together something. you can see the first cut at a persian calendar CFC here.

it doesn't do much except format/convert gregorian dates to the persian calendar and back again (right now it can only parse medium/short persian date formats). still lacks calendar math, real persian date string parsing, arabic-hindic digits date formats, etc.

so what's a persian (or iranian) calendar? why it's the formal calendar in general use in iran, also known as the solar hijri calendar and sometimes as the jalali calendar. i've also seen it described as the shamsi calendar. frankly i have no idea which is correct so i'll stick with "persian". since it's one the few calendars designed in the era of accurate positional astronomy, it's probably the most accurate solar calendar around. you can read more here or here.

i've also been looking at this java calendar class. it has a boatload of calendars (besides persian it has mayan, nepali, hindu, coptic and believe it or not a french revolutionary calendar).

October 1, 2004
what you don't know about latin-1 might hurt you
french cf users might want to pay attention to this...

there is an on-going discussion on the unicode list about "internationalization assumption" which simplistically goes something along the lines of if latin-1 is tested ok can we assume all latin-1 languages are "a-ok"? as it turns out, "no". some of the folks participating in this discussion have pointed out that, for example, not all french chars are found in latin-1. my first thought on reading that was, "oh yeah, the euro" but as it turns out there are a couple of french chars (no idea of their frequency of use but they are used in the french words for eye, egg, beef and heart) that are not in latin-1 but are in latin-9. for example see jukka korpela's excellent latin-1/latin-9 comparison page. these chars are also found in windows 1252 code page (which i guess helps support the idea that it's actually a superset of latin-1).

the moral of the story? just use unicode

September 17, 2004
new version of rbManager
i just discovered IBM's released a new (minor version upgrade) version of it's nifty rbManager tool. you can pickup version 0.7.1 here (scroll to the bottom of the page).

i'm not exactly sure what was changed but i suspect it was a few bugs we encountered with the initial 0.7 release. anyway's its "new".

September 9, 2004
xml owes success to i18n
interesting article about xml on cnet. the article quotes tim bray "One of the reasons XML took off is because it solved a lot of those issues with Unicode, which was fairly new at that point.". "those issues" being diverse languages and character sets. i guess one of those diverse languages is spoken by meat packers (hey its another cf site).

by way of web globalization news.

July 25, 2004
Turkish i18n
if you do i18n work you know that Turkish is often a ticking timebomb, and Turkish s/w end-users are certainly a long suffering lot (for example cf still doesn't run 100% on servers in Turkish locales).

tex texin has pretty good explaination of the main issues. his article includes:

  • an overview of Turkish characters and encodings,
  • a brief discussion of the Turkish language problem and solutions,
  • and just for fun, a brief history of the Turkish language is also included.

pretty good reading.

June 8, 2004
Common Locale Data Repository 1.1 released
hot off the press, the unicode organization has released version 1.1 of the Common Locale Data Repository. its got 50% more data than 1.0, 247 locales spead over 78 languages and 118 countries combinations. the news article indicates that there are also "36 draft locales" in the queue. the repository access page further states that this is "a stable release and may be used as reference material or cited as a normative reference by other specifications." yeah, finally somebody to blame ;-) you can either do a web CVS or download a zip archive of the CLDR from that page.

i urge you to double check your locale's data & report any bugs you find. i'd say this is pretty good news for i18n folks.

reported via the unicode mailing list.

May 31, 2004
i18n good practices: resource bundles
one of the dreariest bits of i18n work is dealing with strings, especially for retro-fitting existing apps. you'll have to comb thru the existing code substituting resource bundle (rb) keys for existing strings. while regex filters, etc. help, nothing beats a pair of "mark IV eyeballs". in order to keep this task within the bounds of tolerable cruelty, there are a few simple things you might keep in mind when developing cf applications:
  • case: not ever language has case, Thai for instance doesn't, so PERMISSIONS, Permissions and permissions would be represented by the same string. in languages that do have case, those kinds of case permutations are plainly cosmetic (i was going to say cosmetic nonsense but thought better). if there's a real application need for this sort of thing, say to accent some heading, it should be handled via CSS and not hardcoded. hardcoded case strings make the difficult i18n process even more so. think twice before you get carried away with case, especially if you find yourself writing complex <cfif> blocks to handle it.
  • pluralization: not every language deals with plurals the same as English, simply adding a letter ("s" for instance) hardly ever cuts it and in some instances the language structure is completely different (the English phrase "five wood blocks" becomes something like "block of wood five units" in Thai). while you can blow off quite a few CPU cycles with complicated logic to handle plurals, i contend that item(s) is just as understandable as

    <cfif someQ.recordCount GT 1>items<cfelse>item</cfif>

    and has the added benefit of i18n simplicity. otherwise you'll have to add another set of rb keys (plural forms vs singular forms) and logic to handle pluralization.

  • compound strings: compound strings are, besides being my pet peeve, strings that contain substituted values. for example, "You owe me #dollarFormat(amountDue)#. Please pay by #dateFormat(normalDueDate)# or I will be forced to shoot you with #numberFormat(budgetQ.bulletsPerDeadbeat)# bullets. Thank you." if you do much i18n research you'll often see folks recommending you avoid compound strings like the plague (for instance, the API for the messageFormat java class comes right and says this). why? because they're hard to handle. first you have to figure out the logic and in some cases its not going to be trivial. then you have to rework the rb string to use place holders for localization ("You owe me {1}. Please pay by {2} or I will be forced to shoot you with {3} bullets. Thank you.") . finally you have to substitute the intended values at runtime--newer versions of my javaRB and RBjava CFC have methods for this. its often much easier to simply rewrite the compound string.
  • floating prepositions: these are perhaps a form of compound string but often can't be handled like them. i sometimes encounter extremely complicated output logic/displays or HTML form elements separated by a preposition (usually "at", "by" or "in"). in its simplest form it might be "dateValue at timeValue" (which actually can be handled as a compound string) but more often then not it's much more complicated. if i can get my way, we normally send floating prepositions to the garbage dump, i mean most folks would have no problem understanding "dateValue timeValue".

i suppose many folks might find this trivial but it adds time and complexity to an already time-consuming and complicated process.

May 10, 2004
three new papers on HTML/XHTML i18n
the GEO task force has published three "First Working Drafts" dealing with characters, encodings and the ever happy-go-lucky BIDI ;-)

http://www.w3.org/TR/i18n-html-tech-char/

http://www.w3.org/TR/i18n-html-tech-lang/

http://www.w3.org/TR/i18n-html-tech-bidi/

pretty good reading.


rtl test blog
now that i can fully control my blog, i decided to test some of the new i18n stuff. i was particularly interested to see how rtl (right-to-left writing system, BIDI) would work. you can see the results here.

the first thing to note is that i didn't translate anything into arabic, just told the blog that it was ar_EG locale ;-) you can clearly see some of the BIDI issues with neutral text like punctuation (parenthesis for instance). it also uses a gregorian calendar rather then an islamic one (and yes, non-gregorian calendars are on the top of my to-do list for this blog).

the original code for this blog can be found on ray camden's blog.

April 22, 2004
seiyaku.com is born!
i dare you to find a more specialized i18n website than seiyaku.com. its a site devoted to "Western style weddings in Japan" (apparently a highly popular way to get hitched in japan these days) and was recently cleaved from tex texin's i18nGuy website. there's some other interesting stuff on that site including 六曜 or rokuyo (lucky/unlucky days of the japanese calendar).

cool.

February 11, 2004
locale currency info
i've been helping out a friend do a quick and dirty currency app. the bank supplying him with currency info jumped up and down on our toes by supplying currency symbols in codepage encodings (a boatload of them) rather than unicode--they were geared towards one feed per locale and made the silly codepage encoding choice based on that. this turned a reasonably simple app into a medium-sized monster thumper management one. this datafeed, i guess since i don't have a lot of experience with these, also dropped the ball on us by not supplying more info about each currency. while there is such a thing as one-half (0.50) of a dollar there is no such thing as one-half of a yen. when and where do we round? oh boy.

if you read this blog with any regularity, you know what's coming ;-) another dip in the java pool under cfmx. we built a quick and dirty (but hey it works) CFC that makes use of the locale currency info contained in java.util.Currency class. you can see it in action here.

i'd appreciate any feedback, note that this shouldn't be used to replace the currency formatting/parsing functions in the i18nFunction CFC. this CFC isolates the currency info for easier, specific access.

February 1, 2004
the superbowl and "I18N"
i'm in bangkok (thailand) watching the "international" broadcast via a live feed from a sports network that starts with an "E" and ends with an "N" of superbowl XXXVIII (38), which this year is sort of unusual (as far as i can recall). the two american announcers are explaining everything, and i mean everything. what a punt is, how to get a first down, what zone defense is, what "play action" is, why the players wear helmets and pads (yes, really), etc. i suppose if this were the very first football or superbowl game being broadcast internationally that might be appropriate but since my neighbors & i got up at 4:00am to watch this game, maybe we know a thing or two already? i'll guess this is the situation in many places around the world.

they are also converting measurements into the SI (metric) system, one of my Thai neighbor's laughingly asked me "when was the last time you heard an NFL linebacker referred to in kilograms and meters?" these guys are also peppering their announcing with references to that other football (soccer to us Americans) and even referring to this as "American" football. the local (Thai language) announcers are ignoring all that goop and announcing the game knowing their audience. there's a lesson here i guess.

one of the interesting things about watching sports "overseas" is that many of the NFL games we get here are raw live feeds. these are really raw, stripped down broadcasts without the special features (sideline interviews, half-time reports, etc.) you'd get from normal network broadcasts. the plus side to this is that we get to see the producer/director shots & hear live mics when they break for commericials (there are no ads permitted on our local cable TV) and during half-time. we'll see the cameras zooming in on hotties in the stands, preview in-game presentations (the replays, analysis, highlights, etc.) and hear what the announcers really think of the game, officiating, etc. (which can sometimes be exactly opposite of what they say when they're "officially live") and every once in a while hear some announcer going beserk (once heard one former QB announcer doing an expletive laden tirade at somebody over the phone). now that's good TV ;-)

December 24, 2003
new i18n stuff on w3c
new should read i18n content on the w3c site:

once again, i'd like to recommend the w3c internationalization activity website to i18n folks. well worth a bookmark.

November 16, 2003
hebrew numbers
tex texin's got a very nifty explanation on the hebrew numbering system (still in use for calendars and religious texts). quoting from his article, "each letter in the hebrew alphabet (or aleph-bet) has a numerical value". there's no zero (the way hebrew numbers are formed it doesn't matter, western numbers, being positioned based would be a mess without a zero value). the first 10 letters of the hebrew alphabet are also the numbers 1-10 with the next 9 letters representing the values 20, 30, 40,50,60,70,80,90,100, the remaining letters represent 200, 300, and 400. i find the way numbers are formed quite interesting--but i leave you to read that in tex's article

October 23, 2003
timeZone CFc bug fixed
Jean-Baptiste Clot found a bug in the timeZoneCFC concerning the inDST function (tells you whether a date is in daylight savings time. it was one of those java bah humbugs. the CFC makes use of a gregorian calendar object (the original java one not ICU4J) where MONTHs are zero-based, that is january is 0. btw the other fields in that object aren't zero-based and that's my excuse for this one. i was constructing the calendar object using year, month, day, hour, minute pulled from the argument date by equivalent cf function where months aren't zero-based. so the dates that were actually being tested were one month in the future.

its fixed and you can find the testbed here. the file in the devnet gallery will be available soon, in the meantime you can find the fixed CFC here .

October 15, 2003
sunrise, sunset
besides being a song in fiddler on the roof, sunrise and sunset is an important part of some calendar calculations. i have been using the ICU4J astronomical calendar but once i got around to double checking the sunrise/sunset times it produced for bangkok i found it be off by almost 3 minutes, not a big deal over day for a calendar, but is kind of meaningful for other stuff like twilight calculations (and i just like to get stuff as correct as i can). so i hunted high and low and found a pretty accurate java package. btw, the majority of stuff i looked at was wrong, some of it laughably wrong (not to say my port to CF is perfect). in any case its posted to the devnet exchange where it be available eventually. the test bed is here.

i guess this would be all sort of ho-hum so i spiced up the CFC a bit by including over 2,500 locations world wide. the access database accompanying the CFC contains names, locality, country, longitude, latitude, and raw GMT offset. the actual timezone info (as used in java) is a bit harder to come by. the next version of this CFC should hopefully have that info plus more detailed data in the US and europe.

October 11, 2003
joel on unicode
joel on software has quite an extensive article on unicode. the article does a nice review of UTF-8, encoding issues and debunks some nonsense about unicode. its also got a bunch of good resource links peppered throughout the article. all in all, a very good read.

he points out one interesting issue about PHP, which i never knew because i don't use it, it doesn't natively support unicode. its got a couple of functions to encode/decode UTF-8 but all i can say about that is "bah, humbug".

October 9, 2003
cutting loose: chineseCalendar CFC
ok, now for the chinese calendar. like the preceding calendars, its based on ICU4J.

the traditional Chinese calendar is a lunisolar calendar (the same type as the Hebrew calendar). months start with a new moon, with each month numbered according to solar events. why? to guarantee that month # 11 will always contains the winter solstice. how? leap months are inserted in certain years (i feel another non-gregorian calendar induced headache coming on). these leap months are numbered the same as the month they follow. which month is a leap month? depends entirely on the movements of the sun and moon (i.e. i can't follow the math very far) . the normal ERA field differs from other calendars as it holds the 60 year "cycle" number, right now we're in the 78th cycle which began in 1983 AD. years are counted sequentially, numbering from the 61st year of the reign of Huang Di, 2637 BC, which is designated year 1 on the Chinese calendar (yes, that's right, this calendaring system is over 4,000 years old). let's look at an example:

星期三 20x78-9-13

where 20 is the year in the current cycle, 78 is the cycle for this calendar (ERA in other calendars), 9 is the month and 13 is the day.

since ICU4J's ChineseCalendar defines an additional field (for leap month) and redefines the way the ERA field (no longer AD,BC, etc.) is used, this CFC has to use a different date format class, ChineseDateFormat.

this CFC adds 4 generic functions (i forgot that some calendars need special date logic):

- isBefore to compare two dates to tell if one is before the other - isAfter which compares two dates to tell if one is after the other - getJulianDay returns the true Julian day for a given date - getExtendedYear returns the extended year, i.e. years since calendar start (in this case, current year + 2637) i'll retrofit these to the other non-gregorgian calendars. the date logic is probably more useful to the calendars that use calendar math different from the gregorian calendar (chinese, hebrew, islamic).

and 7 functions that are specific to this calendar (though i guess some can be applied to other calendars): - isLeapMonth determines if a given date is in a leap month - getCycle returns cycle for given date - getCycleYear returns year in cycle for given date - getMonth returns month in cycle year for given date - getDay returns day in month for give date - getDayOfYear returns day of cycle year for given date - getWeek returns week of cycle year for given date

the CFC's testbed is here. posted to the devnet gallery where i guess it will become available sooner or later.

next the astronomical calendar. this is one is quite tricky, its also somewhat in a state of flux (the ICU4J team's working on this code) but since it forms the basis of some of the existing calendars might as well give it a shot.


astronomicalCalendar beta
compared with the other five calendars, this one's turning into a real barn burner.

astronomicalCalendarCFC, determines the positions of the sun and moon, the time of sunrise and sunset, moonrise and moonset, moon phases (full, new, etc.), vernal equinox, summer solstice, etc. for the most part, the CFC seems to work Ok but there are a few sticky issues or at least things i don't quite get. the getSunrise/getSunset functions are supposed to return the GMT time of sunrise/sunset on the local date to which this calendar is currently set (i construct each astronomicalCalendar object with a location, lat-long and then set a date). for Bangkok, where the testbed server is, the returned sunrise, etc. times seem reasonable enough. however for sites in north america, like Philly, Scranton or Saskatoon the sunrise/sunset times appear reversed. i can't tell (yet) whether these are the GMT times for their local sunrise/sunset or the local times for the testbed server (GMT+7). or something else entirely.

this ICU4J calendar class is sort of experimental, so the docs, etc. aren't the clearest. need more testing before this thing can be shipped to the devnet gallery.

the testbed is here. if you want to play around w/the CFC as it now stands, you can download it here.

ho hum. well at least this posting hasn't mentioned the EOLAS Patent ruckus (until now ;-)

October 7, 2003
that was easy: japaneseCalendar CFC
as promised on the 5th, a japaneseCalendar CFC based on ICU4J.

the Japanese calendar, sometimes called the Japanese Emperor Era calendar, is identical to the Gregorian calendar except for the year and era (which is why it was so easy to turn into a CFC). each emperor's ascension to the throne begins a new era. each new era's years are numbered starting with 1 (the year of ascension). what could be simpler?

the "modern" eras:

  • Meiji: January 8, 1868 AD
  • Taisho: July 30, 1912 AD
  • Showa: December 25, 1926 AD
  • Heisei: January 7, 1989 AD (current era)

you can find the testbed here. note i've added a function to determine the day the week starts (for use in some calendaring components i'm working on). it actually depends on your locale. in Thailand & the US, a week starts on sunday. in France, Poland, etc. it starts on monday. the calendar used (as far as ICU4J is concerned) doesn't matter much. i'll update the other non-gregorian calendar CFC after i'm thru with the next two calendars: chinese and astronomical.

this CFC should appear on the devnet gallery soon enough.

October 5, 2003
an Islamic Calendar CFC
so far i've built CFCs for the Hebrew and Buddhist calendars, both based on IBM's standup ICU4J java lib. now its time for the Islamic calendar. and whether you like it or not, here comes the usual background info dump. resistance is futile, you will be globalized.

the Islamic calendar (also known as "Hijri" since it starts at the time of Mohammed's emigration or "hijra" to Medinah on thursday, july 15, 622 AD ) is the civil calendar used by most of the Arab world and is the religious calendar of the Islamic faith. it is a strict lunar calendar. an Islamic year of twelve lunar months therefore does not correspond to the solar year used by the Gregorian calendar system. an Islamic year averages about 354 days, so viewed from the Gregorian calendar, each subsequent Islamic year starts about 11 days earlier.

the civil Islamic calendar uses a fixed cycle of alternating 29 and 30 day months, with a leap day added to the last month of 11 out of every 30 years (oh joy, 11 days shorter and now this--i've run out of fingers and toes). that makes the calendar predictable so it is used as the civil calendar in a number of Arab countries. the Islamic religious calendar is based on the observation of the crescent moon. sounds simple enough. but that observation varies from where you at when you look (your geography), when you look (sunset varies by season you know) , moon orbit "eccentricities" (i'll take the astronomer's word for that), and even the weather (too cloudy and you obviously can't see the moon). all this makes it impossible to calculate in advance, so the start of a month in the religious calendar might differ from the civil calendar by up to three days. that makes knowing which calendar variant folks use very important. in any case, ICU4J short cuts all this, for the sake of speed, by using approximations of the astronomical calculations.

the islamicCalendarCFC test bed is here. if you've looked at the other two calendar CFC you should notice i've tried to maintain function and argument conventions across these CFCs. the islamicCalendar CFC differs in that it has an optional boolean "useCivil" argument to tell the CFC which calendar variant to use. this CFC will bubble up in the devnet gallery soon enough.

next up is the Japanese calendar.

October 2, 2003
silent but deadly
ran into a nasty issue with dateDiff returning the equivalent of a java Long (32bit int). which is fine if you use dateDiff for years, months, days, etc. however if you want minutes or seconds dateparts and the date difference is large, say the difference in seconds between 1-jan-1970 and 1-feb-2038. your app will silently kill itself, the Long value the dateDiff function returns will quitely wrap to negative. i lost quite a bit of sleep tracking this down for my non-gregorian calendar CFCs (based on icu4j). you can see teh issue quite clearly here.

so if you're dealing with small dateparts and large date differences, watch out for this.

i have to thank andrew tyrone for first finding the bug in the hebrewCalendarCFC & steven r. loomis for working with me to track down the problem with icu4j.

August 26, 2003
resourceBundleCFC
i've been using resourceBundles in one form or another for some time now. while my idea of resourceBundles is not always confined to file based resources, i have had a UDF--now CFC--resourceBundle file function for a time now. mainly because getProfileString's never properly handled unicode--cf really needs a native resourceBundle-like function.

on the off chance somebody's wondering, a resourceBundle is a file holding text label key/value pairs seperated into locale files--the reason for this is to completely seperate text from code and text presentation. for instance:

testMsg_th_TH.properties (thai locale) contains welcomeMSG=สวัสดีคะ while testMsg_en_US.properties (american locale) contains welcomeMSG=Well hello there.

the application would determine which locale was required (session or application based depending on how you rolled out your application) and then load the relevant resourceBundle. the welcomeMsg text label would then show up in the proper language. simple, easy, scalable.

in any case, i've put the resource file CFC in the devnet gallery (should be available sooner or later). you can see an example and download it here if you're in a hurry to trash it.

again wandering off-topic, i've been trying to make use of native java resourceBundle (getBundle, etc.) functionality with cf, no dice so far. getBundle never seems able to find the resourceBundle. no idea if this would function any better than the way i'm doing it now but i' sure like to find out. any ideas?