given the timezone hell i recently passed through and the recent US and Australia DST changes, i plan on beefing up the section on timezones. and in keeping w/ben's idea to slim things down, we'll be pushing most of those "boring" locale table comparisons out to on-line appendices. might also add a wee bit on using flex in g11n cf apps.
so before i really begin the excruciating process of revising that chapter, i'm looking for feedback on it. anything missing? anything not too clear? you can respond here or simply email me with your suggestions.
thanks.
among my favorites that apply in one way or another to coldfusion (i've yakked about these in various articles/books/blog entries but good stuff usually bears repeating):
- Unicode encodes characters, not glyphs: U+0067 » ggggggg
- Unicode does not encode characters by language: French, German, English j have the same code point even though all have different pronunciations; Chinese 大 (da) has the same code point as Japanese 大 (dai).
- Length in bytes may not be N * length in characters
- Not all text is correctly tagged with its charset, so character detection may be necessary. But remember, it's always a guess.
- Use properties such as Alphabetic, not hard-coded lists: isAlphabetic(), /p{Alphabetic} in regex
- Transliteration (Ελληνικά ↔ Ellēniká) is not the same as Translation (Ελληνικά ↔ Greek)--users of my transliteration CFC please take note
- Unicode ≠ Globalization. Unicode provides the basis for software globalization, but there's more work to be done...
- Don't simply concatenate strings to make messages: the order of components different by language. Use Java MessageFormat or equivalent. (like the rbJava or javaRv CFCs)
- Don't put any translatable strings into your code; make sure those are separated into a resource file.
- Don't assume everyone can read the Latin alphabet. Don't assume icons and symbols mean the same around the world.
- Tag all data explicitly. Trying to algorithmically determine character encoding and language isn't easy, and can never be exact.
- Formatting and parsing of dates, times, numbers, currencies, ... are locale-dependent. Use globalization APIs that use appropriate data.
- If you heuristically compute territory IDs, timezone IDs, currency IDs, etc. make sure the user can override that and pick an explicit value. (ie be automagical about locale choice, etc. but allow the user to manually pick what they want)
- Don't assume the timezone ID is implied by the user's locale. For the best timezone information, use the TZ database; use CLDR for timezone names.
- Java globalization support is pretty outdated: use ICU to supplement it. (cf developers should use ICU4J)
- how to make utf-8 HTML pages which is a good read even if it does contain a bizzare note about windows notepad and the BOM.
- determining a file's encoding most notable for it's advice, basically use a browser ;-)
- some i18n sun blogs (none of which i knew about):
- i18n G.A.L. For all things international, only some of them software...
- norbert lindenberg's blog sun's technical lead for java i18n (he doesn't like these kinds of abbreviations, which is too bad because i do)
- tim forster's blog mostly about translation tools
but you already know all that....
- icu4j: i literally couldn't do g11n work without this java library. while much of its pioneering i18n functionality has been absorbed into the java core, it still offers hard/impossible-to-duplicate functionality like non-gregorian calendars, holidays & super-sized collations. it is the bee's knees of i18n s/w. and of course, its free.
- unicode: after watching folks' codepage encoding antics in the user forums, what can i say, just use unicode ©.
- Common Locale Data Repository: while still in beta, the CLDR is going to be the locale reference. it was thought to be so important that its maintainence was handed-off to the unicode organization by the openi18n org. need to know the currency used in Thailand? short weekday names used Turkish? writing system direction in Afghanistan? this repository is the place to look first. all the info is contained in an XML file per locale (not that i enjoy parsing XML files but i can put up with that chore for the goldmine of locale info it provides).
- rbManager: if you do g11n work, you build resource bundles (well you should be doing this anyway). if you build resource bundles (rb), then you need a tool. i've looked at and played around with a bunch of rb tools & still haven't found anything as easy to use or as sophisticated as rbmanager, the price (free) is pretty good too. i18nEdit gets an honorable mention for its nifty unicode char picker for those days when you're too lazy to load another locale.
- SC UniPad: need a unicode text editor that can handle inuktitut and brail at the sametime? look no futher than the plenty fine SC UniPad. i get a kick out of just using it. also a nice tool to double check rb files.
- unifier: if you have to batch convert text/html docs from codepage encodings to unicode (and who doesn't) this will probably be the best 15 bucks you'll ever spend.
- javaInetLocator: i built my geoLocator CFC around nigel wetter's javainetlocator class. if you need to know the country and locale of a user (well their IP anyway), this is probably the best non-commercial tool around (and i can say its probably better than many commercial ones i've looked). its fast (i have another geoLocator tool built around db-based IP range queries and nigel's class beats the pants & socks off of it) and free.
- iText: i've used this java library quite a bit to burn PDFs. it offers really fine control that we often need (municipal tax receipts for instance) & is a piece of cake to use.
- cfstudio 5: what can i say, i'm old and in the way. while my colleagues laugh that i still use this "antique", i keep remnding them that muscle memory means more and more as you get older (i've literally pounded the alt f & s keys off of several keyboards over the years while i still have the same industrial-strength ms mouse for almost 10 years). and nope, no reference as i couldn't for the life of me tell you where to buy this these days. that said, i'm trying to give cfEclipse a fair trail (it would help a whole bunch though if it had better docs, hint hint spike).
- java i18n forums: while i don't spend much time there these days, these forums are still a valuable i18n info source. if you do serious i18n work with cf, you know you have to dip down into java quite a bit and if you get stumped as much as i did, these forums are often a life saver. another good java library/info site is of course IBM's developer works. just a for instance, i wanted to learn how to do i18n string searchs & "Efficient text searching in java" turns up (yes that article is a bit dated).
- books-on-line (BoL): i do a lot of work with ms sql server (frankly i prefer it) and the BoL has come to be my constant companion (my cat neutron uses the pile of sql books i've bought over the years as a spot to cat nap--speaking of cats i still get a great kick out of the my cat hates you site). you really can't to better than this for an ms sql server reference.
at first glance it looks like it just does text localization, which while not the only part of i18n work, it is however the dreariest. MAT also really won't help apps that aren't at least somewhat i18n (at least according to the public FAQ). from the public site, i'm not really sure if it does web apps. not sure what smaller localization shops will make of this. it might lose them their marginal/low end business. is nothing safe ;-)
now if we could just get mm to provide native resourceBundle functionality....
and oops! in my zeal over the cookie encoding issue posted a few days ago, i failed to doublecheck whether setEncoding function actually works with cookie scope. it doesn't of course. sorry about that. i wonder if it should?
you can find some more interesting reading on g11n business aspects here. the article on chinese whispers is particularly cool.
ps: yes i know canada is bilingual and a very compelling case for "backyard globalization" too but i just couldn't resist ;-)

