- Unicode 5.1
- locale data: Common Locale Data Repository (CLDR) 1.6
- charset converter file size improvement
- date interval formatting (note only gregorian calendar is supported n this release)
- improved plural support
specific icu4j changes include:
- charset
- ICU2022 converter
- HZ converter
- SCSU/BOCU-1 converter
- charset converter callback
- thai dictionary break iterator (yeah)
- JDK TimeZone support (this is pretty decent as you can now share tz IDs between coldfusion/core java & icu4j)
- locale service provider
- more convenient formatting of year+month, day+month, and other combinations
- simple duration formatting
- that it has upgraded it's resource data to Unicode 5.1 and CLDR 1.6
- added date interval formatting (ie Jan 10, 2008 to Jan 20, 2008 becomes Jan 10-20, 2008, 10:10am to 11:10am becomes 10:10-11:10am, etc.). downside is that currently it's only gregorian calendar)
- added DurationFormat so you can now format over a duration in time such as "2 days from now" or "3 hours ago".
- added "Locale Service Provide" support for core java's new locale service--many folks just want the filthy-rich and frequently-updated locale data that icu4j has and not the whole library. i wonder if there is a way to backdoor this into coldfusion's locales?
you can grab the jar files/api docs and read more about the new stuff here.
- updated to use CLDR version 1.5.1
- updates to timezone formatting and parsing (haven't checked if the bananas tz update is included)
- some bug fixes detailed here in the readme file
download the lib here: icu4j 3.8.1
- uses the latest cldr 1.5.0.1 locale data
- the long discussed rule based timezone changes which gives us the ability to read and write timezone data in RFC2445 VTIMEZONE format as well as also providing access to Olson timezone transitions! this is something many people have been needing for quite some time now, this is going to be very useful
- tawainese calendar (a flavor of gregorian calendar that numbers years since 1912AD)
- the Indian National Calendar (more complicated flavor of the gregorian calendar, eg it's synched up with the gregorian calendar's leap years but the extra day is added to the first month, Chaitra which starts march 22 on gregorian calendar--so, yup, it's complicated)
- charset conversion bugs were fixed and CESU-8, UTF-7 and ISCII converters have been added. also some conversion speed improvements. the UTF-7 one will be useful for email (bounce) handling
- a new MessageFormat type for plurals was added
- a pretty useful new DurationFormat class was added so you can format messages over a duration in time such as "2 days from now" or "3 hours ago"
- also the MessageFormat class will now take named arguments instead of just arrays (too bad now that coldfusion 8's javacast got a shot of steroids)
- new BIDI stuff (which i still need to investigate)
next month i'll be adding the new calendars as CFCs to the usual bits. i'll also be doing some significant changes to most of the i18n formatting methods to take better advantage of the calendar, etc. keywords (en_GB@calendar=indian,currency=EUR) on the ULocale class (icu4j's super cool locale class).
unfortunately the persian calendar still appears to be only in icu4c (C/C++) only.
- it uses the latest and greatest cldr 1.5 locale data
- the long discussed rule based timezone changes which gives us the ability to read and write timezone data in RFC2445 VTIMEZONE format as well as also providing access to Olson timezone transitions! this is stuff many people have been looking for, this is going to be very useful
- tawainese calendar (which i never knew existed, looks like a flavor of gregorian calendar that numbers years since 1912AD)
- the Indian National Calendar (ditto though looks like a more complicated flavor of the gregorian calendar, eg it's synched up with the gregorian calendar's leap years but the extra day is added to the first month, Chaitra which starts march 22 on gregorian calendar--so, yup, it's complicated)
- charset conversion bugs were fixed and CESU-8, UTF-7 and ISCII converters have been added. also some conversion speed improvements. i think the UTF-7 one looks pretty useful
- a new MessageFormat type for plurals was added, looks like some eastern european languages have complicated rules for plurals
- a new DurationFormat class so you can format messages over a duration in time such as "2 days from now" or "3 hours ago" (this one looks useful)
- also the MessageFormat class will now take named arguments instead of just arrays (too bad now that coldfusion 8's javacast got a shot of steroids)
- bunch of new BIDI stuff (which need some investigating)
i'll be adding the new calendars as CFCs to the usual bits as soon as i do enough background research on them to understand any "quirks". i'll also be doing some significant changes to most of the i18n formatting methods to take better advantage of the calendar, etc. keywords (en_GB@calendar=indian,currency=EUR) on the ULocale class (icu4j's super cool locale class).
looks like a persian calendar was also added but appears to be only in icu4c (C/C++) only for the time being.
wow, fun times in the old town tonite (it's actually in the AM in bangkok but you get the idea).
first this article confirms that PHP's unicode/i18n support really is lame (also see this article for a bit older take on PHP's unicode/i18n support, i especially liked the Unicode should have been in PHP five years ago quote). but more importantly, and what's surprising to me, is that they're actually doing something about it by adopting ICU. going from being an i18n joke to fully supporting unicode/i18n via the ICU project. i know next to nothing about the PHP world so i have no idea if this is really happening (or has already happened) or is just hot air but it looks like they're on the right track with ICU.
wonder if there's a lesson here?
and yes, even though it helps "date" me, i am still a fan of Prince's 1999.
core java's locale data for en_AU (Australia) and en_NZ (New Zealand) time formats is a bit off. it uses a format of H:mm:ss where the "H" stands for 24 hour clock, ie 5:00 PM would be formatted as 17:00. the CLDR (common locale data repository) however states that the time format for en_Au & en_NZ locales is h:mm:ss a (well actually it's proposed to include the timezone, "h:mm:ss a z" see the en_AU time format entry here). while most users in those locales are smart enough to get that 17:00 is 5:00 PM when your ColdFusion app outputs time values, it would play havoc when ColdFusion tries to parse what those same folks would normally input for a time value.
so hey en_AU and en_NZ locale people, time to start helping yourselves. Sun has accepted this as a new bug, go vote for it (you have to be a member of the Sun Developer Network to vote but these days, who isn't).
if you're still on older versions of icu4j, you should be ok as this is a new bug introduced in 3.6.
- supports unicode 5.0
- common locale data repository (CLDR) 1.4
- globalization preferences, flexible container for locale data was added
- a preview of the flexible date/time format generator (allowing multiple date and time format patterns to be generated) was added
- a preview of the ICU4J implementation of the java.nio.charset.Charset API was added
and as the project site notes, be careful using the preview stuff in production.
- support for Unicode 5.0
- 25% more CLDR locale data in 245 locales in ICU
- a flexible date/time format generator has been added, allowing for multiple date and time format patterns to be generated that are valid for specific locales (sounds interesting)
- under "Globalization Preferences", a new flexible container for locale data was added
- for more charset conversion bang-for-your-buck, a preview of the ICU4J implementation of the java.nio.charset.Charset API was added
addendum: apparently the nifty timezone bits proposed earlier this year didn't make it into this release. too bad, so sad, could have been very useful.
to recap:
- i18n is a zero level goal (that is the project won't leave home without it).
- it will be based on icu4j java library and by based i mean every single i18n function, except some parts of the resource bundle CFC and (probably) the Gregorian calendar will be derived from it.
- besides the basic Gregorian calendar most ColdFusion developers are familiar with, this project will also include Buddhist, Chinese, Japanese, Islamic, and Hebrew calendars to handle that tricky calendar math.
- user centric timezone, users will see datetimes in their individual timezones--and yes, even this functionality will come out of icu4j. by divorcing this functionality from core Java, the project will be able to take advantage of icu4j's more frequent updates.
- locale based collation (sorting).
- strict use of resource bundles (rb), you will be able to l10n skin this puppy, though we haven't 100% decided on the "recommended" rb management tool yet. besides icu4j's rb manager, any ideas?
- standard localized date/numeric/currency formatting, all hail CLDR.
- the project will make use of the super cool JavaLoader in order to load the icu4j from off the server classpath (shared hosts will not be a problem). this also allows for painless updating of the icu4j jar file.
so, have we missed anything? some i18n related functionality we've overlooked? any rb managemnet tool you particularly like? if you have any ideas please submit them here as comments or better yet via the UI preview. we'd really appreciate it. thanks.
for more information on the project see the "BoardFusion News Page" and the Project Wiki.
one of the side effects of this core java locale is that ColdFusion's old locale name Norwegian (Nynorsk) actually produces no_NO locale data. any legacy apps still using this locale identifier are probably telling people the wrong thing, for example:
writeoutput('#lsDateFormat(now(),"DDDD")#');
produces: mandag
while
setLocale('Norwegian (Nynorsk)');
writeoutput('#lsDateFormat(now(),"DDDD")#');
also produces: mandag
icu4j on the otherhand produces:
måndag for nn_NO
mandag for nb_NO
it looks like ColdFusion got tripped up on the "variant instead of language" locale.
taking this a step further, doing a "FULL" date format shows up even larger differences between core java and icu4j:
core java
8. mai 2006 for no_NO
8. mai 2006 for no_NO_NY
icu4j
måndag 8. mai 2006 for nn_NO
mandag 8. mai 2006 for nb_NO
oops. to my way of thinking, a "FULL" date format should include the day name as well as the rest of the date (date in month, month and year). i really wish ColdFusion would use icu4j.
and the "A-Go-Go" reference? nothing to with g11n or ColdFusion, just been listening to a lot of Dengue Fever lately and that song has just stuck in my head ;-)
- com.ibm.icu.util.ZoneRule: an abstract class representing a tz transition rule. this class represents basic properties of zone rule such as raw UTC offset and DST offset and abstract methods to access onset information.
- com.ibm.icu.util.TimeListZoneRule: a concrete class extending ZoneRule. this class represents zone transition point(s) defined by UTC millis.
- com.ibm.icu.util.RecurrentZoneRule: a concrete class extending ZoneRule. this class represents recurrent zone transitions defined by a rule, such as first Sunday in April. the way to define recurrent rule is pretty similar to SimpleTimeZone.
- com.ibm.icu.util.RuleBasedTimeZone: a class extending TimeZone. this class aggregates one or more ZoneRule instances. using this class and ZoneRule instances, you can create a custom TimeZone which supports any historical zone transitions.
- com.ibm.icu.util.VTimeZone: a class extending TimeZone, wraps either RuleBasedTimeZone or OlsonTimeZone (default TimeZone implementation used by ICU4J). this class would have two constructor methods for creating a new VTimeZone instance from 1) TZID such as "America/New_York" and 2) RFC2445 VTIMEZONE component. this class also provides some method to write out underlying zone rules into VTIMEZONE format.
in addtion to the new classes mentioned above, he also proposes some modifications to existing classes:
- com.ibm.icu.util.TimeZone: an additional method - "List getZoneRules()", which returns a list of ZoneRule instances for the TimeZone. the implementation in TimeZone class just throws UnsupportedOperationException.
- com.ibm.icu.util.SimpleTimeZone / com.ibm.icu.impl.OlsonTimeZone: overrides "List getZoneRules()" to return actual ZoneRule instances for these TimeZone implementation.
the javadocs for the proposed changes have been (temporarily) put up here. if you want to participate in the discussion regarding these changes hop on over to the ICU sourceforge site and subscribe to the mailing list.
jitter bug references: 4577, 5012
to me these seem like some decent improvements and i know several folks in the ColdFusion community are interested in timezones, especially their rules.
default="#tzObj.getDefault().ID#"
icu4j:
default="#variables.timeZone.getDefault().getDisplayName()#"
the tz that the core java default method was returning wasn't understood by icu4j but it didn't throw an error but silently returned the UTC tz instead. whoops.
you can pick up the new version here.
icu4j on the other hand, has had this and other updated timezone info for some time now.
// remote init jarFile=jarLocation & "icu4j.jar";
URLObject = createObject('java','java.net.URL');
URLObject.init("file:" & jarFile);
URLArray = createObject("java","java.lang.reflect.Array").
newInstance(URLObject.getClass(),1);
arrayClass = createObject("java","java.lang.reflect.Array");
arrayClass.set(URLArray,0,URLObject);
loader = createObject("java","java.net.URLClassLoader");
loader.init(URLArray);
uLocale=loader.loadClass("com.ibm.icu.util.ULocale").newInstance();
</cfscript>
<cfdump var="#uLocale#">
while i've managed to workaround this issue (ULocales are everywhere in icu4j, most classes that deal with locales have a getAvailableULocales() method) it's always kind of nagged at me. after a bit of poking and prodding i started looking into ways to get at the actual constructors for a given class:
jarFile=jarLocation & "icu4j.jar";
URLObject = createObject('java','java.net.URL');
URLObject.init("file:" & jarFile);
URLArray = createObject("java","java.lang.reflect.Array").
newInstance(URLObject.getClass(),1);
arrayClass = createObject("java","java.lang.reflect.Array");
arrayClass.set(URLArray,0,URLObject);
loader = createObject("java","java.net.URLClassLoader");
loader.init(URLArray);
uLocale=loader.loadClass("com.ibm.icu.util.ULocale"); // don't init c=uLocale.getConstructors();
for (j=1; j LTE arrayLen(c); j=j+1) {
params=c[j].getParameterTypes();
for (i=1; i LTE arrayLen(params); i=i+1) {
writeoutput("ULocale[#j#]: #i# #params[i].getName()#<br>");
}
writeoutput("<br>");
}
</cfscript>
which in this case returned 3 constructors (just like the API says but not in the javadocs order):
ULocale[1]: 1 java.lang.String ULocale[1]: 2 java.lang.String ULocale[1]: 3 java.lang.String
ULocale[2]: 1 java.lang.String
ULocale[3]: 1 java.lang.String ULocale[3]: 2 java.lang.String
which i can easily match to the one i want (ULocale("th_TH")):
// remote init jarFile=jarLocation & "icu4j.jar";
URLObject = createObject('java','java.net.URL');
URLObject.init("file:" & jarFile);
URLArray = createObject("java","java.lang.reflect.Array").
newInstance(URLObject.getClass(),1);
arrayClass = createObject("java","java.lang.reflect.Array");
arrayClass.set(URLArray,0,URLObject);
loader = createObject("java","java.net.URLClassLoader");
loader.init(URLArray);
uLocale=loader.loadClass("com.ibm.icu.util.ULocale");
c=uLocale.getConstructors();
// the newInstance method wants an array
obj=listToArray("th_TH");
// we want the 2nd constructor
thaiLocale=c[2].newInstance(obj.toArray());
</cfscript>
<cfdump var="#thaiLocale#">
which indeed returns an object of com.ibm.icu.util.ULocale.
since in most cases, i only use one way to init a given class, this technique will work OK for us. my only question is will the order of constructors remain the same? can i always count on the 2nd constructor to be ULocale("th_TH")? or should i build metadata functionality to probe the constructors to see which one matches?
ps: i did indeed learn my lesson, notice how i passed the coldfusion array using toArray() ;-)
ozLocale="en_AU@calendar=gregorian";
thisPattern="On {0,date,short} at {0,time,short}, I left {1} for the {2}. I took {3,number,currency}";
thisLocale=createObject("java","com.ibm.icu.util.ULocale").init(ozLocale);
args=arrayNew(1);
args[1]=now();
args[2]="the office";
args[3]="microbrewery";
args[4]=javacast("int",100);
mf=createObject("java","com.ibm.icu.text.MessageFormat").
init(thisPattern,thisLocale);
thisMsg=mf.format(args);
</cfscript>
<cfdump var="#thisMSG#">
coldfusion would always throw an error at the thisMsg=mf.format(args) bit along the lines of: Error casting an object of type to an incompatible type. This usually indicates a programming error in Java, although it could also mean you have tried to use a foreign object in a different way than it was designed. which for some reason made me think it was because the format() method is overloaded and i couldn't figure out the right combination of argument classes to get it to work. my knee jerk reaction to this is to build a wrapper class and move on, which i promptly did.
i was puttering around with something this weekend (a method to count business days using icu4j's Holiday class) when i actually got the overloaded method error (while trying to add my birthday as a national holiday in the US virgin islands, en_VI). re-visiting the format() method errors it finally dawned on me that the error message was perfectly accurate and the real issue (besides me being a knee jerk reactionist and thick as a brick) was with the args array. coldfusion arrays aren't exactly java Arrays (if i recall correctly they're java.util.Vectors). back in the Triassic era, christian cantrell's blog had an entry concerning this problem where he pointed out a simple solution using the inherited toArray() method. so changing thisMsg=mf.format(args) to thisMsg=mf.format(args.toArray()) made that method work plenty fine. initial benchmarks show this java-based method to be considerably faster than our in-house one, not to mention saving all the locale formatting code we had to use prior to substituting the actual data. we'll be releasing updates to our resource bundle CFCs incorporating this new method sometime this week.
the sharp-eyed among you probably noticed the peculiar way i defined the locale en_AU@calendar=gregorian. icu4j locales (ULocales to be precise) have, besides the usual language, country, variant identifiers, keywords. keywords allow you to create a locale using a specific calendar, collation or currency (see the ICU user guide for details). in practice that means you can control the way MessageFormat formats your dates and currencies without having to mess around with them prior to submitting the data to the format() method. you can use any of the seven odd calendars that icu4j knows about, for instance en_AU@calendar=buddhist would produce dates formatted using the Buddhist calendar (BE), en_AU@calendar=islamic-civil would format dates using the civil version of the Islamic calendar, etc. very cool if you ask me. this is another area where icu4j kind of glances in the rear-view mirror as it blows by core java's i18n bits ;-)
- Olson 2006a time zone data (just in time to get ready for the new DST in the US)
- corrects mistakes in the CLDR data found in icu4j 3.4.2
- MessageFormat (like core java's but it can use icu4j's super cool ULocale class) upgraded to @stable"
- fixed bugs in DateFormat, SimpleDateFormat, etc.
- and a bit more trivial (to me) but should make some folks happy this release no longer tags "@draft" APIs with "@deprecated" by default--though why they ever did that in the first place is a bit of a mystery to me
the MessageFormat class is kind of cool in that it handles compound rb strings (which i'd rather have never learned about) such as: "At {1} on {2}, there was {3} on planet {4}". in the past, we normally handled this with in-house methods which are somewhat cumbersome in that we needed to do any date/numeric/currency formatting on the substituted values for the message's placeholders (the bits in between the {}) prior to formatting the message. now using the com.ibm.icu.text.MessageFormat you could do something like:
mfObject=createobject("java","com.ibm.icu.text.MessageFormat");
args=arrayNew(1);
args[1]=now();
args[2]="the office";
args[3]="microbrewery";
// pass in the message string and substitution arguments thisMsg=mfObject.format("On {0,date,full} at {0,time,full}, I left {1} for the {2}.",args);
writeoutput(thisMsg);
</cfscript>
which would produce something like (in the en_US locale) "On Wednesday, March 1, 2006 at 8:44:22 PM GMT+07:00, I left the office for the microbrewery.".
to explain a bit more : {0,date,full} is a placeholder that takes the first element in the args array (java arrays start at 0) and applies localized date formatting with the "full" style. {0,time,full} ditto but uses time formatting and {1} and {2} are placeholders for simple strings.
however in order to make this more flexible (ie. use locales other than the server's default), you'll have to use a simple java wrapper class--the MessageFormat format method is overloaded and coldfusion can't easily use it's other "flavors" which require StringBuffer and FieldPosition classes.
what has this got to do with g11n? well even if you do use those Big math classes, core java's NumberFormat class doesn't understand it's own BigDecimal/BigInteger classes (ie it casts everything back to double/long). so when you come to display these values you're back in the same situation that mark's post describes. what to do? use icu4j of course (everybody knew that was coming). it's NumberFormat class understands BigDecimal/BigInteger plenty fine. for example:
theNumber="9123456789123456789.123";
//use server default locale nF=createObject("java","com.ibm.icu.text.NumberFormat").getInstance();
cNF=createObject("java","java.text.NumberFormat").getInstance();
bigDecimal=createObject("java","java.math.BigDecimal").init(theNumber);
formattedNumber=nf.format(bigDecimal);
coreJavaFormattedNumber=cNF.format(bigDecimal);
writeoutput("original number:=#theNumber#<br>
big decimal representation:=#bigDecimal#<br>
icu4j number Formatted:=#formattedNumber#<br>
core java number Formatted:=#coreJavaFormattedNumber#");
</cfscript>
which outputs:
big decimal representation:=9123456789123456789.123
icu4j number Formatted:=9,123,456,789,123,456,789.123
core java number Formatted:=9,123,456,789,123,457,000
i really wish coldfusion would use icu4j. it would make i18n work much easier and as a side effect help w/problems like this.
tz=createObject("java","com.ibm.icu.util.TimeZone");
//get TZ based on country
zones=tz.getAvailableIDs("TH");
</cfscript>
<cfdump var="#zones#">
how cool is that?
anyway, grab it to keep in lock step w/IBM's ICU project.
among my favorites that apply in one way or another to coldfusion (i've yakked about these in various articles/books/blog entries but good stuff usually bears repeating):
- Unicode encodes characters, not glyphs: U+0067 » ggggggg
- Unicode does not encode characters by language: French, German, English j have the same code point even though all have different pronunciations; Chinese 大 (da) has the same code point as Japanese 大 (dai).
- Length in bytes may not be N * length in characters
- Not all text is correctly tagged with its charset, so character detection may be necessary. But remember, it's always a guess.
- Use properties such as Alphabetic, not hard-coded lists: isAlphabetic(), /p{Alphabetic} in regex
- Transliteration (Ελληνικά ↔ Ellēniká) is not the same as Translation (Ελληνικά ↔ Greek)--users of my transliteration CFC please take note
- Unicode ≠ Globalization. Unicode provides the basis for software globalization, but there's more work to be done...
- Don't simply concatenate strings to make messages: the order of components different by language. Use Java MessageFormat or equivalent. (like the rbJava or javaRv CFCs)
- Don't put any translatable strings into your code; make sure those are separated into a resource file.
- Don't assume everyone can read the Latin alphabet. Don't assume icons and symbols mean the same around the world.
- Tag all data explicitly. Trying to algorithmically determine character encoding and language isn't easy, and can never be exact.
- Formatting and parsing of dates, times, numbers, currencies, ... are locale-dependent. Use globalization APIs that use appropriate data.
- If you heuristically compute territory IDs, timezone IDs, currency IDs, etc. make sure the user can override that and pick an explicit value. (ie be automagical about locale choice, etc. but allow the user to manually pick what they want)
- Don't assume the timezone ID is implied by the user's locale. For the best timezone information, use the TZ database; use CLDR for timezone names.
- Java globalization support is pretty outdated: use ICU to supplement it. (cf developers should use ICU4J)
if you're interested in using icu4j's new AcceptLanguage method, you'll need to wrapper it. this method makes use of an 'out-parameter' method to return a boolean as to whether the method used a fallback locale (ie. it couldn't find a suitable locale among the server's installed locales, so it returns a fallback locale instead). coldfusion won't pick up on that returned boolean array. below find some java code for this (it returns a structure with the selected locale and whether or not it was a fallback locale):
import com.ibm.icu.util.ULocale;
public class ULocaleAcceptLanguage {
/*
class: ULocaleAcceptLanguage
version: 15-jul-2005
author: Paul Hastings paul@sustainableGIS.com
notes: simple wrapper class for ICU4J acceptLanguage
*/
public final static HashMap getULocale(String httpAcceptLanguage){
HashMap results = new HashMap();
boolean[] fallback = new boolean[1];
ULocale thisLocale = ULocale.acceptLanguage(httpAcceptLanguage,fallback);
Boolean fallB= new Boolean(fallback[0]);
results.put("locale",thisLocale.toString());
results.put("fallback",fallB.toString());
return results;
}
}
compile this and drop it in your cfinstall classes dir. you can then make use of it:
aL=createObject("java","ULocaleAcceptLanguage");
acceptLanguageStr="en-us,th;q=0.7,ar;q=0.3";
uL=al.GetULocale(acceptLanguageStr);
</cfscript>
<cfdump var="#uL#">
do youself a favor, get this library.
- updated to Unicode 4.1
- collation engine updated to UCA 4.1
- fully conformant with CLDR 1.3
- charset detection framework (which looks very useful)
- message formatting apostophe solution
- additional usability APIs
- new currency listing API
- more API for accessing CLDR data
- Coptic and Ethiopic calendars (that makes 8 icu4j calendars and Dr. Ghasem Kiani's persian calendar for a total of 9, count 'em 9, calendars)
- more efficient data loading
and in case you were wondering, today (2-jul-2005) is October 25, 1721 in the Coptic calendar and October 25, 7497 (Amete Alem Era) in the Ethiopic calendar system.
- a complete set of POSIX-format data generated, along with a tool to generate different platform versions.
- the addition of new data to support localization of timezones
- the addition of data for UN M.49 regions, including continents and region
- the canonicalization (data in many forms converted to a "standard" form) of the data files, including the consolidation of inherited data
- currency codes are restricted to ISO 4217 codes (historical as well)
- number and data tests to verify LDML implementations
- metadata for LDML
- mappings from language to script and territory
- various other fixes and additions of data, and extensions to the specification
for more details see the press blurb and the version information page.
as a reminder, icu4j makes use of the CLDR for it's locale data. hubba hubba.
which brings us to the point of this blog entry, this method expects the year argument to be a persian calendar "year" (right now its 1383 in the persian calendar). which i didn't quite grasp at first, as the other calendars (gregorian, buddhist and japanese) with leap years have an isLeapYear method that expects a gregorian year (yes, even the buddhist and japanese calendar classes expect a gregorian year, i imagine this is because these calendars extend the gregorian calendar class). and that's the way i expected the new persian calendar to behave (my own cultural bias--i use the buddhist and gregorian calendars on a daily basis). but it doesn't and why the heck would it? it is a persian calendar after all. so that got me to thinking about the other calendars and the way these "should" work and what other cultural biases have leaked into our code and test harnesses--especially the tests.
first thing i did was to rewrite the i18nIsLeapYear functions across all the calendars to expect a year argument in that calendar's system (it converts to gregorian year as needed and now automagically returns false for calendars lacking the concept of a "leap year").
then i went a hunting for any other places where my cultural bias might have leaked thru....and promptly found it in the getYear function. the getYear function takes a gregorian year value and returns the year in that calendar's system. i was doing that by creating a date:
(and just in case you were wondering, the 2 for the day value is to make sure the date value created fell into that year, given that we're using UTC as the time zone standard for all the calendars). and then setting the calendar object to that date and returning the value for that calendar object's YEAR field:
return tCalendar.get(tCalendar.YEAR);
simple and worked swell for the gregorian, buddhist and japanese calendars because these calendars' year started at the same time. but after looking at the year values of formatted dates from the other calendars i realized that the getYear function was returning horrible nonsense for the other 4 calendars. without realizing it, i'd let my calendar bias creep in and assumed the calendar's were all the same as far as years were concerned. gregorian 2-jan actually falls into different calendar years depending on the calendar (of course, they're different freaking calendars). and the tests were only reporting whether the getYear function "worked" by checking if the year was a positive integer, no eyeball comparisons against the year bits of the formatted date strings. there's a lesson here some where.
so better grab the new code and maybe give the calendars a good poking at to make sure no other cultural bias is left in it.
note that this version of the persian calendar uses a "well-known arithmetic algorithm for calculating the leap years" rather than astronomical calculations.
i'd like to publicly thank Dr. Ghasem Kiani for his work on this project, we've been waiting quite a while for a persian calendar to round off our i18n calendars. thanks.
a lunar calendar was used in japan from the 14th to the 19th century. that calendar had a six day week and those six days were known as rokuyo. and like any other calendar system, each day had a name and a particular meaning (you do know that the english weekdays are named after one of the seven "planets" of ancient times?). and of course, each day had a significance:
- sakigachi good luck in the morning, bad luck in the afternoon
- tomobiki good luck all day, except at noon
- sakimake bad luck in the morning, good luck in the afternoon
- butsumetsu Unlucky all day, as it is the day Buddha died
- taian 'the day of great peace', a good day for ceremonies
- shakku bad luck all day, except at noon
while i'd guess few people would admit to closely adhering to this system, it does invoke some strange "better safe than sorry" behaviors. for instance, some hospital patients in japan won't agree to be discharged on butsumetsu day, as it's regarded as being very unlucky. rather they'd stay the extra 24 hours to be discharged on a lucky taian day.
the calculations for determining rokuyo turn out to be surprisingly difficult. in fact, the only published code i ever saw for this was developed by Eirik Rude, a cf developer (at that time living in japan). the complexity comes from the need to calculate lunar months (remember the old japanese calendar?). since i wanted to integrate this functionality with our existing icu4j-based calendars, i poked thru the lunar calendars (chinese, islamic and hebrew) that i knew about to see if we could use any of these. of course, the old japanese lunar calendar was basically the lunisolar chinese calendar. using Eirik's basic logic and the icu4j library i was able to considerably reduce the code's complexity (the complexity's still there, but i pushed it down into the icu4j java library where smarter people than i have already dealt with it).
the rokuyo testbed is here and the i18n calendars package incorporates this new functionality (pick japanese calendar from the select). and this is a good resource if you want to read more about rokuyo.
time:= {ts '2005-02-20 16:56:03'}
cf epoch:=38403.7055903 (days since 31-dec-1899)
universal time from cf time:=632,447,745,630,000,000
universal time to cf time:= 38403.7055903
coldfusion timescale:=38403.7055903 (days since 31-dec-1899)
excel timescale:=38403.7055903 (days since 31-dec-1899)
db2 timescale:=38403.7055903 (days since 31-dec-1899)
windows timescale:=6.3244774563E+017 (ticks (100 nanoseconds) since 1-jan-0001)
windowsfile timescale:=1.2753478563E+017 (ticks (100 nanoseconds) since 1-jan-1601)
mac timescale:=130697763 (second since 1-jan-2001)
oldmac timescale:=3191849763 (seconds since 1-jan-1904)
unix timescale:=1109005407 (seconds since 1-jan-1970)
java timescale:=1.109004963E+012 (milliseconds since 1-jan-1970)
the CFC will be in the usual places in a bit.
- ICU main page
- library's download page
- ICU documentation page, with the icu4j API docs now here
- icu4j FAQ
- RB manager
- additional docs
on the topic of icu4j, i knocked off a couple of pages to explore it's new ULocales class (after somebody asked me how many new locales for India and i had no idea). i was surprised by the answer.
if that doesn't surprise you, try the United Kingdom or Ethiopia.
the code is also considerably improved, its now based on ICU4J version 3.2 and it's ULocale class (232 locales, 100 more than blackstone). several of the more commonly used functions have been re-written and we're seeing 3x-4x speed improvement over the older versions. frankly, i'm a bit baffled why, for instance:
following the ICU4J API and some examples, we initialized date formatting objects with the calendar class (Buddhist, Chinese, Gregorian, Hebrew, Islamic,Japanese) we were working with:
var thisCalendar=aCalendar.init(utcTZ,thisLocale);
// return formatted date
return aDateFormat.getDateInstance(thisCalendar,tDateFormat,
thisLocale).format(dateConvert("utc2local",arguments.thisDate));
was reworked into this:
var tDateFormatter=aDateFormat.getDateInstance(tDateFormat,thisLocale);
// swap calendars tDateFormatter.setCalendar(aCalendar.init(utcTZ,thisLocale));
return tDateFormatter.format(dateConvert("utc2local",arguments.thisDate));
this builds the date formatter object with the default calendar, then we swap it to the calendar we want to use (the tDateFormatter.setCalendar bit). that sped up this function 3x-4x! while it "seems" less efficient it actually worked quite a bit faster.
you can see the testbed and download the CFC package here. any comments appreciated.
and now we all know why there's no persian calendar in icu4j....those rotten klingons are blocking it.
- icu4j locale data is now 100% built from the CLDR 1.2 data, and has data for 232 locales!
- the user guide got a major overhaul (not that anybody reads user guides but hey, they did overhaul it)
- Universal Timescale conversions have been added that allow you to swap between binary datetimes on different platforms
Accept-Language, icu4j now provides a mechanism for parsing http_accept_language vars and matching against locales--no more parsing these ourselves, and i can tell you the ones from Apple boxes used to give me the dry heavesoops, this didn't make it into the final release (so apple http_accept_language vars are still making me sick)- RFC 3066 locale ID support has been added
- and of course bug fixes
if you do any i18n work, you should pick up this release. you'll find it here.
this is a pretty significant release. to the already nifty features it adds:
- icu4j locale data is now completely built from the CLDR 1.2 data which includes interesting locales like en_US_POSIX English (United States, Computer), eo Esperanto, fa_AF Persian (Afghanistan), kl_GL Kalaallisut Greenland), kw_GB Cornish (United Kingdom) and a whole bunch more. that's 230 icu4j locales vs 134 locales in core java!
- icu4j now overloads it's methods that accept locales to take either java locales or it's own uLocales
- Universal Timescale conversions
- DateTimeFormat object initialization performance improvement!!
- and of course bug fixes ;-)
there's also an eclipse how-to for icu4j.
all in all, its pretty cool.
in case you're interested, there's also a cldr wiki.
at about the same time there was an announcement on the icu4j mailing list about the next version being built on CLDR data. so i asked if that meant that we'd be able to make use of all the "new" locales in CLDR like farsi, etc. one of the icu4j guys (steven loomis) replied "yes" and further pointed out that icu4j 2.8 was already making use of icu4c's locale data. further discussion with steven helped debunk one of my long held misconceptions, that a java "locale" was a real world "Locale" (ie. the locale bundled up with all it's attendant resource data such as day/month names, etc.). "Locales are just identifiers" says steven, "duh!" says i. while it's convenient to think locales == Locales, formally in java "locale" refers to the identifier and not the data.
so what? what that means, if you're using icu4j for your i18n work (and you should), is that you have access to all the nifty locales that icu4j has no matter what core java supports (or doesn't support in this case). so something like this becomes possible (and easy):
<cfscript>
fullFormat=javacast("int",0);
farsiLocale=createObject("java","java.util.Locale").init("fa","IR");
utcTZ=createObject("java","com.ibm.icu.impl.JDKTimeZone").getTimeZone("UTC");
aDateFormat = createObject("java","com.ibm.icu.text.DateFormat");
aCalendar =createObject("java","com.ibm.icu.util.GregorianCalendar").init(utcTZ,farsiLocale);
dF=aDateFormat.getDateInstance(aCalendar,fullFormat,farsiLocale);
writeoutput("#farsiLocale.getDisplayName(farsiLocale)# #dF.format(now())#<br>");
</cfscript>
which produces:
Persian (Iran) دوشنبه، ۱۸ اکتبر ۲۰۰۴
note that the core java getDisplayName method falls back on "Persian (Iran)" which while not perfect is better than nothing. icu4j 3.0 ULocale class would actually produce the correctly localized name.
the more i work with icu4j, the more impressed i am with how well-thought it is. it really is the bees' knees for i18n work.
thanks to steven for enlightening me.
IBM's found & fixed these, but not yet updated the jar.
you can see the bug in action here. beyond that bug, that page also shows off spike's oh so cool relative classpath technique. its actually loading & using two different versions of icu4j, none of which are in mx server's classpath. yeah i know, i'm easily impressed, but to my mind spike's technique is cool. it works around a whole lot of dependency issues we have had to live with.
in more icu4j news, IBM's also just announced the release of a new version of rbManager. we use this tool a lot--it's the cat's pajama's of rb tools.
while i was perusing the icu4j site i stumbled across this interesting page: collation performance comparison. wow! icu4j beats the snot out of the plain java JDK for collation over most locales (except for ja_JP and ko_KR locales, note that locales <> collation). i know that collation is of some interest to many i18n folks, so this is kind of interesting news.
| Source | Datatype | Unit | Epoch |
|---|---|---|---|
| JAVA_TIME | int64 | milliseconds | Jan 1, 1970 |
| UNIX_TIME | int32 | seconds | Jan 1, 1970 |
| ICU4C | double64 | milliseconds | Jan 1, 1970 |
| WINDOWS_FILE_TIME | int64 | ticks (100 nanoseconds) | Jan 1, 1601 |
| WINDOWS_DATE_TIME | int64 | ticks (100 nanoseconds) | Jan 1, 0001 |
| MAC_OLD_TIME | int32 | seconds | Jan 1, 1904 |
| MAC_TIME | ? | seconds | Jan 1, 2001 |
| EXCEL_TIME | ? | days | Dec 31, 1899 |
| DB2_TIME | ? | days | Dec 31, 1899 |
java and Unix while having the same epoch (origin) differ in datatype and units so they differ in accuracy and range. windows' time scales differ internally for OS vs file system (no snickering). at the current state of this proposal, he's chosen to use Windows datetime as a "universal 'pivot'". that gives a time scale range from 29,000 BC to 29,000 AD. i guess IBM really does take the long term view ;-)
if you want to provide feedback i guess you'll have to join the ICU mailing list.
so now you know.
- historical timezones: "where daylight savings time rules or other related data have changed after the date in question". cool.
- updated locales and more locale methods (to access stuff like paper page sizes, measurement systems, etc.). cool.
- improved sorting (now does proper Thai Royal Dictionary order). way cool for me ;-)
- XLIFF conversion tool (in case you're developing your own resource data)
- a how-to for using eclipse with ICU4J
- bug fixes, performance improvements, etc.
its available from this page.
once again, ibm's icu4j comes through. its com.ibm.icu.text.RuleBasedNumberFormat class has a nifty format method with spellout rulesets for some locales (in this case we're only interested in thai but there are others available in the class). once i slapped a wrapper class around it's format method it was good to go. you can see it in action on this testbed. i'll make it and the wrapper class available once i get currency formatting setup and tested as well as figure out how to add other locale's rulesets (as well as get other rulesets' data, for instance i'd really like to see arabic locales rulesets').
one bone i have to pick w/mx's java support is the constant need to write wrapper classes to handle (dumb down) various format() methods. it makes distributing and maintaining some i18n CFCs more of a pain than need be. i was hoping some java guru might explain the whys and the wherefores, any takers?
quoting the ICU4J site:
list of significant changes for the 2.6.1 release:
-UCA 4.0 ICU has been updated to use the latest version of UCA - 4.0.
-Thai Royal Dictionary Collation: Thai collation tailoring has been updated to reflect the Thai Royal Dictionary ordering. Changes have been made to collation code in order to properly support invalid Thai sequences.
-Collation: parser/builder bug fixes: Several bugs in collation rule parser and builder have been fixed.
-Unicode character properties data has been synched with ICU4C
-Other bug fixes: Bugs have been fixed in layout engine (jitterbug number 3041), BiDi (3174), string functions (3243) and platform support (3097).

