Viewing By Category : unicode / Main
September 18, 2008
unicode reading list
the unicode consortium just released a recommended reading list. there's even a couple in french & german.

i see one i don't have to amazon. you can never have enough books on unicode (or coldfusion for that matter) ;-)

May 5, 2008
the death of codepages?
mark davis, via the unicode mailing list, mentioned an offical google blog posting that shows that unicode "was the most frequent encoding found on web pages" since dec-2007 (unicode, utf-8, is the blue line on the graph below). wow. i guess people really do get it :-)

reference: Moving to Unicode 5.1

June 25, 2007
derby does unicode
no idea why but i only just now got around to testing coldfusion 8's embedded derby database for unicode support. first thing i did was create a db using ben's going-to-get-even-easier advice to create a db in derby. i got a bit freaked out when i tried to create a table to test unicode strings using an Nvarchar data type and derby spit back a Feature not implemented: NATIONAL CHAR VARYING error. that started me scratching my head for a few minutes while it slowly dawned on me that derby is a java based db and unicode would be native to it. changed the Nvarchar to a plain varchar datatype and bob's your uncle.

very nice indeed.

September 18, 2006
analysis of the olmec hieroglyphs
michael everson, a virtual language encoding machine and leading light in the unicode world, has just posted a brief analysis of the recently discovered "olmec hieroglyphs". while the analysis isn't a "decipherment", i find the way michael attacked the analysis fascinating.

February 21, 2006
good i18n practices really are good
an i18n-related issue popped up on the cfeclipse list yesterday that reinforced (at least to me) that good i18n practices really are good. a user had their eclipse encoding setup as UTF-8 yet was getting their unicode coldfusion pages garbaged. my first look at this used code from our existing codebase and of course it worked. for the life of me, well for 2-3 hours anyway, i couldn't see how this was going wrong. it wasn't until i whipped up a simple dummy page that just had unicode text and nothing else that i was able to see the problem. the issue is simple but clearly illustrates a good i18n practice.

eclipse (not cfeclipse) doesn't add a BOM to UTF-8 encoded files. why? well

  • the BOM isn't actually required as part of the definition of UTF-8 (and i know of plenty of s/w that either doesn't write one out or in fact strips them from files)
  • in the past (i think) the java compiler wouldn't compile a file w/a BOM & since that's what eclipse was originally meant for, NOT having a BOM makes perfect sense (from a very a quick test i just ran it seems this is no longer true, at least from within eclipse)

so why was our cfeclipse-edited UTF-8 encoded code working? because we follow our own good i18n practices and liberally use encoding hinting starting with the cfprocessingdirective. each of our coldfusion pages starts with:

<cfprocessingdirective pageencoding="utf-8">

BOM or no BOM, this ensures your code will be always be interpreted as UTF-8. for more good i18n practices grab a copy of the advanced coldfusion book.

see? good i18n practices really are good.

February 18, 2006
unicode font madness
ever needed a font to handle Berber language? or Khmer? while i most often use the massive Arial Unicode MS for our i18n work there are some rare occasions where it doesn't contain the glyphs we need. and other occasions where i simply like the way a font looks (like Tifinagh abjad used to write Berber).

well, look no further. the Unicode Font Guide For Free/Libre Open Source Operating Systems has put together a super cool collection of free/cheap fonts covering pretty much every language in the world. the content is organized regionally which to me makes a boat load of sense.

the main site also has some excellent font/web related resources including an XHTML and CSS guide for middle school students (which i dare say some cf developers, like me for example, could make good use of).

this is another excellent i18n resource to add to your bookmarks.

just for fun, below is an example of Tifinagh abjad. tell me this doesn't look so cool, there's something almost alien about it.
Tifinagh abjad

October 24, 2005
g11n gotchas
a couple-three emails i got recently prompted me to think (again) about what globalization means to the average coldfusion developer. coincidentally mark davis, IBM's front man for g11n and president of the Unicode Consortium, is putting together a presentation for the next Unicode conference dealing with "Globalization Gotchas". i highly recommend cf developers doing i18n/g11n work to review these, it's certainly worth the effort.

among my favorites that apply in one way or another to coldfusion (i've yakked about these in various articles/books/blog entries but good stuff usually bears repeating):

  • Unicode encodes characters, not glyphs: U+0067 » ggggggg
  • Unicode does not encode characters by language: French, German, English j have the same code point even though all have different pronunciations; Chinese 大 (da) has the same code point as Japanese 大 (dai).
  • Length in bytes may not be N * length in characters
  • Not all text is correctly tagged with its charset, so character detection may be necessary. But remember, it's always a guess.
  • Use properties such as Alphabetic, not hard-coded lists: isAlphabetic(), /p{Alphabetic} in regex
  • Transliteration (Ελληνικά ↔ Ellēniká) is not the same as Translation (Ελληνικά ↔ Greek)--users of my transliteration CFC please take note
  • Unicode ≠ Globalization. Unicode provides the basis for software globalization, but there's more work to be done...
  • Don't simply concatenate strings to make messages: the order of components different by language. Use Java MessageFormat or equivalent. (like the rbJava or javaRv CFCs)
  • Don't put any translatable strings into your code; make sure those are separated into a resource file.
  • Don't assume everyone can read the Latin alphabet. Don't assume icons and symbols mean the same around the world.
  • Tag all data explicitly. Trying to algorithmically determine character encoding and language isn't easy, and can never be exact.
  • Formatting and parsing of dates, times, numbers, currencies, ... are locale-dependent. Use globalization APIs that use appropriate data.
  • If you heuristically compute territory IDs, timezone IDs, currency IDs, etc. make sure the user can override that and pick an explicit value. (ie be automagical about locale choice, etc. but allow the user to manually pick what they want)
  • Don't assume the timezone ID is implied by the user's locale. For the best timezone information, use the TZ database; use CLDR for timezone names.
  • Java globalization support is pretty outdated: use ICU to supplement it. (cf developers should use ICU4J)

June 23, 2005
as you might already know utf-7 is not a supported java (and hence cf) charset. it does however exist in the wild, mainly as part of bounced email systems and sometimes used in webmail like hotmail (well mainly hotmail, i've never seen it anywhere else to tell you the truth) as well as MS Exchange. folks have been complaining off and on about this for years, many mistakenly blaming macromedia for a sun java bug. votes have piled up in sun's java bugparade but alas and alack, nothing's been done about it. until now. there's a very persistent thread (its been running since feb-2004) in the cf support forums concerning this issue. a few days ago somebody (gdbezona) posted a link to an opensource utf-7 charset JCharset. if you drop that jar (jcharset.jar) into the cfinstall/runtime/jre/lib dir and stop/restart cf server ervice, cf will pick up that utf-7 charset fine. we've exercised this jar pretty thoroughly over the last two days and it has yet to blow up in our faces. it works with cfpop/cfmail/cfile and shows up in the server's available charsets via our charset CFC.

if you're experiencing this issue, you might want to give this thing a whirl.

May 5, 2005
goowy does unicode
the newest webmail kid on the block goowy has just implemented unicode in their super-cool flash based webmail. originally the beta didn't support unicode and as usual, i was complaining a blue streak about them not supporting unicode and even gave them a public "bah humbug" for that. gary benitt, one of the founders of goowy, publicly stated they would be implementing unicode this week and by golly they sure did (i imagine it was in the works for weeks, unless they are the world's fastest flash coders). i just ran my standard unicode test against it and it passed "a-ok". this is one of the very rare occasions that i have to take back a "bah humbug" (not that many people pay attention one way or the other but i like to keep the record straight).

so, if you're looking for a modern flash-based webmail, then certainly give goowy a spin.

March 24, 2005
charsets galore
after researching charsets for the [expletive deleted] time to help somebody on the forums, i decided it was time to create a tool to do away with some of that kind of tedious labor. so building on the API for java.nio.charset.Charset i whipped out a small CFC to poke and prod the charsets available on a given server (or to be more precise, charsets supported by cf's JRE). you can see it here. it can be used to deliver the available charsets on a cf server, determine if a charset is supported, and find out if one charset contains another.

oh yeah, once again in case you haven't been paying attention Just Use Unicode. it will save you a lot of trouble over the long run.

on another note, this CFC (100+ lines) was also the first piece of code i wrote from start to finish with cfeclipse. while it wasn't an entirely unpleasant experience, i think it will take me quite a bit more "getting used to" before i give up cfstudio for good.

March 21, 2005
diversity as wallpaper
starting off with the idea of printing all of unicode's characters on a 36 inch by 36 inch poster, ian albert ends up with 6 foot by 12 foot wallpaper printed at Kinko's. imagine that, most of humanity's writing systems printed at Kinko's for 20 bucks. i wonder what the clerk made of it?

December 31, 2004
iso-8859-1 vs ms windows latin-1
just as a relief from the near constant news and grief about the tsunami in this part of the world, here's some of this blog's normally technical content.

while digging around on jguru i stumbled on this quite old, but still relevant, comparison between the iso-8859-1 and ms windows latin-1 charsets. if you scroll down a bit you will see a table of entities with the "extra" ms windows latin-1 highlighted in green. now you know why i'm always harping on about non-unicode encoding--Just Use Unicode.

December 15, 2004
two new i18n tidbits
first, the latest version of the Unicode Standard (4.1.0) which is due out in march, 2005 is now in beta. some of the new stuff i find interesting are:
  • newly added complete scripts such as new Tai Lue script (it's used in the yunnan area of southern china and south to northern thailand) among others
  • "very significant extensions to the repertoire for the Arabic script"
  • new chars were added to support "roundtrip mapping support for HKSCS and GB 18030"
  • i also find it interesting that "106 CJK compatibility ideographs has been added to support roundtrip mapping to the DPRK standard"--you know, north korea

now, i guess i'm going to have to rework my uBlock CFC. you can read more about the new unicode beta here.

next since i'm always ragging on core java's i18n support, i'd thought i'd point out a nifty new tech tip at Core Java Technologies Tech Tips dealing with resource bundles. this tech tip examines when and where you should be using ListResourceBundle vs PropertyResourceBundle. we normally use PropertyResourceBundle when applications can't access the classpath (ala the javaRB CFC) and plain ResourceBundle when it can (with rbJava CFC). as an added benefit this article gets into some testing using java 5.0 (or 1.5) new nanoTime() method (as in nanoseconds) as well as offering a link to a java one presentation on how not to write a benchmark.

both are pretty good reading.

October 1, 2004
what you don't know about latin-1 might hurt you
french cf users might want to pay attention to this...

there is an on-going discussion on the unicode list about "internationalization assumption" which simplistically goes something along the lines of if latin-1 is tested ok can we assume all latin-1 languages are "a-ok"? as it turns out, "no". some of the folks participating in this discussion have pointed out that, for example, not all french chars are found in latin-1. my first thought on reading that was, "oh yeah, the euro" but as it turns out there are a couple of french chars (no idea of their frequency of use but they are used in the french words for eye, egg, beef and heart) that are not in latin-1 but are in latin-9. for example see jukka korpela's excellent latin-1/latin-9 comparison page. these chars are also found in windows 1252 code page (which i guess helps support the idea that it's actually a superset of latin-1).

the moral of the story? just use unicode

cldr 1.2 alpha
unicode has just announced the public release of the alpha version of the cldr (Common Locale Data Repository). some of the highlights include:

  • better documentation for date/number format patterns (one of my favorites)
  • added stuff about references/validity/etc.
  • new timezone localization model
  • weekend data
  • added Oriya,Malayalam,Assamese,Welsh,Dzongkha,Bhutan,Khmer and Lao (woohoo se asian) locales
  • added more country,language,currency, and type display name data for ar,bg,cs,el,he,hr,hu,is,mk,pl, ro,ru,sk,sl,sr,tr,uk (the arabic stuff is way cool)

read more on the cldr website. you can compare the cldr versus platform data here. and you can report bugs here.

via the unicode mailing list.

May 5, 2004
bow wow
besides endlessly arguing about such things as "Arid Canaanite Wasteland" or "palaeo-Hebrew" the unicode folks also hand out awards for "outstanding personal contributions to the philosophy and dissemination of the Unicode Standard". they call that one the bulldog award. that reference is actually to thomas huxley's comment in the 1870's:

You know I have to take care of him [Darwin] -- in fact, I have always been Darwin's bull dog.

well this year's award winner is none other than tex texin the debonair i18nguy about town. congratulations.

April 17, 2004
enable unicode dsn option's secret revealed
let me first ask what you think this label means, Enable Unicode for data sources configured for non-Latin characters? if you're like me, you'd probably think it meant that enabling this puppy would force the db driver (sql server in my case) to use Unicode for all text to/from the database. and as it turns out you (and me) would be wrong. now let me backtrack a bit and explain that with sql server you'd normally use unicode hinting--the "N" in N'test'-- to let the db know that a particular chunk of text is actually Unicode. so if you understood Enable Unicode as to actually enable unicode you might be tempted to not use Unicode hinting in your cf sql code. and of course you'd garbage your text data as a result. you might be asking why i'd never picked up this before now (heck blackstone's just around the corner)? because i'm what some folks might call nutso about Unicode, i always go the extra mile in dealing with Unicode--i always use Unicode hinting even when i enabled the Unicode option for a given dsn. so this is an issue i'd never noticed before.

so what exactly does Enable Unicode enable? why it controls how the cfqueryparam deals with Unicode text. if you turn it on, cfqueryparam handles Unicode text correctly. turn it off and cfqueryparam turns your Unicode text into a mound of garbage.

let me thank figleaf's steve drucker for bringing this issue up in first place (in the forums) and mm's hiroshi okugawa for digging up an mx 6.0 box to test that Enable Unicode has always worked this way.

so now you know.

update: for ray's blog users, here's what you should be looking for in your cfadmin (under the advanced menu for the blog DSN):

look at that

January 1, 2004
unicode compression
unicode is good. unicode is great. it's by far the best choice for g11n work (i "bite my thumb" at codepage encodings). but like everything else in the real world it has its seamier side. in order to encode all the world's scripts unicode must often use more than 1 or even 2 bytes for many characters. ASCII only folks with an occasional need for non-ASCII characters wouldn't give this a second thought, the rest of the world (especially CJK folks) however aren't so fortunate. "unicode bloat" is a sad but true fact of g11n life. unicode compression is therefore a fairly interesting topic to many unicode users and developers. doug ewell has posted a pretty understandable article on unicode compression. the article provides a nice background to unicode and compression. it's a good read and well worth the time.

and by now i hope everyone's had a safe and happy new year's eve celebration. for those of us already past that and wondering "what the heck went on", i refer you to the jean luc-ponty tune (composed and arranged by frank zappa) and ask the question "how would you like to have a head like that".