Viewing By Category : encodings / Main
May 5, 2008
the death of codepages?
mark davis, via the unicode mailing list, mentioned an offical google blog posting that shows that unicode "was the most frequent encoding found on web pages" since dec-2007 (unicode, utf-8, is the blue line on the graph below). wow. i guess people really do get it :-)

reference: Moving to Unicode 5.1

February 21, 2006
good i18n practices really are good
an i18n-related issue popped up on the cfeclipse list yesterday that reinforced (at least to me) that good i18n practices really are good. a user had their eclipse encoding setup as UTF-8 yet was getting their unicode coldfusion pages garbaged. my first look at this used code from our existing codebase and of course it worked. for the life of me, well for 2-3 hours anyway, i couldn't see how this was going wrong. it wasn't until i whipped up a simple dummy page that just had unicode text and nothing else that i was able to see the problem. the issue is simple but clearly illustrates a good i18n practice.

eclipse (not cfeclipse) doesn't add a BOM to UTF-8 encoded files. why? well

  • the BOM isn't actually required as part of the definition of UTF-8 (and i know of plenty of s/w that either doesn't write one out or in fact strips them from files)
  • in the past (i think) the java compiler wouldn't compile a file w/a BOM & since that's what eclipse was originally meant for, NOT having a BOM makes perfect sense (from a very a quick test i just ran it seems this is no longer true, at least from within eclipse)

so why was our cfeclipse-edited UTF-8 encoded code working? because we follow our own good i18n practices and liberally use encoding hinting starting with the cfprocessingdirective. each of our coldfusion pages starts with:

<cfprocessingdirective pageencoding="utf-8">

BOM or no BOM, this ensures your code will be always be interpreted as UTF-8. for more good i18n practices grab a copy of the advanced coldfusion book.

see? good i18n practices really are good.

February 13, 2006
more on encoding
encoding issues just never seem to end. after another week's worth of helping folks slog thru their encoding problems, i recalled that sun has recently published a pretty decent article on their SDN about encoding (even includes a nice mojibake example).

and while it's mainly java/jsp it's worth the read for us cf folks.

June 23, 2005
utf-7
as you might already know utf-7 is not a supported java (and hence cf) charset. it does however exist in the wild, mainly as part of bounced email systems and sometimes used in webmail like hotmail (well mainly hotmail, i've never seen it anywhere else to tell you the truth) as well as MS Exchange. folks have been complaining off and on about this for years, many mistakenly blaming macromedia for a sun java bug. votes have piled up in sun's java bugparade but alas and alack, nothing's been done about it. until now. there's a very persistent thread (its been running since feb-2004) in the cf support forums concerning this issue. a few days ago somebody (gdbezona) posted a link to an opensource utf-7 charset JCharset. if you drop that jar (jcharset.jar) into the cfinstall/runtime/jre/lib dir and stop/restart cf server ervice, cf will pick up that utf-7 charset fine. we've exercised this jar pretty thoroughly over the last two days and it has yet to blow up in our faces. it works with cfpop/cfmail/cfile and shows up in the server's available charsets via our charset CFC.

if you're experiencing this issue, you might want to give this thing a whirl.

March 24, 2005
charsets galore
after researching charsets for the [expletive deleted] time to help somebody on the forums, i decided it was time to create a tool to do away with some of that kind of tedious labor. so building on the API for java.nio.charset.Charset i whipped out a small CFC to poke and prod the charsets available on a given server (or to be more precise, charsets supported by cf's JRE). you can see it here. it can be used to deliver the available charsets on a cf server, determine if a charset is supported, and find out if one charset contains another.

oh yeah, once again in case you haven't been paying attention Just Use Unicode. it will save you a lot of trouble over the long run.

on another note, this CFC (100+ lines) was also the first piece of code i wrote from start to finish with cfeclipse. while it wasn't an entirely unpleasant experience, i think it will take me quite a bit more "getting used to" before i give up cfstudio for good.

January 6, 2005
back to our regularly scheduled i18n programming
a couple more non-tsunami i18n bits of information.

that i18n guy about town, tex texin, has put together a good document concerning the use of RFC 3066 language identifiers. you might lend a hand by perusing the table for any funny business (maybe like sinhalese in thailand--but hey, what do i know).

and just when i thought i knew everything about encoding (maybe because i actually think all you really have to know is Just Use Unicode), i find out something new. while doing some research in the java i18n forums i stumbled onto a really nifty java encoding resource, part of a java and internet glossary. i especially liked the term armouring (which i had never heard used in this context before): Converting binary data into printable gibberish so that data transport systems will not corrupt it. so that's what it's called.

December 31, 2004
iso-8859-1 vs ms windows latin-1
just as a relief from the near constant news and grief about the tsunami in this part of the world, here's some of this blog's normally technical content.

while digging around on jguru i stumbled on this quite old, but still relevant, comparison between the iso-8859-1 and ms windows latin-1 charsets. if you scroll down a bit you will see a table of entities with the "extra" ms windows latin-1 highlighted in green. now you know why i'm always harping on about non-unicode encoding--Just Use Unicode.

December 13, 2003
supported encodings
while this is sort of an older bit of information, a few encoding issues recently popped up in the forums, so i guess it bears repeating once again.

the latest sun jre (1.4.2) that ships with mx default installs only a few encodings (latin-1, latin-9, greek, eastern european, cyrllic, unicode, etc). no arabic, hebrew, asian, etc. languages. for that you need to do a custom install. if you try to use an encoding from the custom (or international) set with a default install, you will see "UnsupportedEncodingException" errors.

so if you want to use codepage encodings beyond the default installs you will need to do an international install (unless of course the installer recognizes these locales on your server during setup). you can read more about this here.

so now you know (again).