Viewing By Entry / Main
February 21, 2006
good i18n practices really are good
an i18n-related issue popped up on the cfeclipse list yesterday that reinforced (at least to me) that good i18n practices really are good. a user had their eclipse encoding setup as UTF-8 yet was getting their unicode coldfusion pages garbaged. my first look at this used code from our existing codebase and of course it worked. for the life of me, well for 2-3 hours anyway, i couldn't see how this was going wrong. it wasn't until i whipped up a simple dummy page that just had unicode text and nothing else that i was able to see the problem. the issue is simple but clearly illustrates a good i18n practice.

eclipse (not cfeclipse) doesn't add a BOM to UTF-8 encoded files. why? well

  • the BOM isn't actually required as part of the definition of UTF-8 (and i know of plenty of s/w that either doesn't write one out or in fact strips them from files)
  • in the past (i think) the java compiler wouldn't compile a file w/a BOM & since that's what eclipse was originally meant for, NOT having a BOM makes perfect sense (from a very a quick test i just ran it seems this is no longer true, at least from within eclipse)

so why was our cfeclipse-edited UTF-8 encoded code working? because we follow our own good i18n practices and liberally use encoding hinting starting with the cfprocessingdirective. each of our coldfusion pages starts with:

<cfprocessingdirective pageencoding="utf-8">

BOM or no BOM, this ensures your code will be always be interpreted as UTF-8. for more good i18n practices grab a copy of the advanced coldfusion book.

see? good i18n practices really are good.

Comments

But isn't the default encoding for .cfm pages anyway UTF-8, even if you don't specify it with cfprocessingdirective?

Hmm, reading CF7 dev guide PDF quickly I saw mentions of default output in UTF-8, and that CF looks for BOM when loading .cfm files, but nothing about default encoding if no BOM is in place.

But there is a way to set/override the default encoding:

"Default to the JVM default file character encoding. By default, this is the operating system default character encoding. To specify the JVM default file character encoding, use the -Dfile.encoding= switch in the JVM Arguments field of the ColdFusion MX Administrator Java and JVM Settings page."


default encoding for cf is controlled by the defaultCharset item in cf_root/lib/neo-runtime.xml file which is usually utf-8. the cfdocs outline how cf determines encoding (especially where there is no BOM):

=========

1) Use the BOM, if specified on the page. Macromedia recommends that you use BOM characters in your files.

2) Use the pageEncoding attribute of the cfprocessingdirective tag, if specified.

3) Default to the JVM default file character encoding. By default, this is the operating system default character encoding. To specify the JVM default file character encoding, use the -Dfile.encoding= switch in the JVM Arguments field of the ColdFusion MX Administrator Java and JVM Settings page

=========

for number 3, that could be pretty much anything depending on the localized OS but is often windows-1252 a superset of iso-8859-1 (latin-1) and a real PITA for many people.


Paul,

egarding the use of the tag <cfprocessingdirective pageencoding="utf-8">:

If using a framework that runs your site from one file (i.e. index.cfm), can you just place that tag once at the top of the first file (either index.cfm, Application.cfm, or Application.cfc) or should it be in every include and custom tag? What about CFCs (for people who output within their methods... Of course I don't output within methods, but methods still do allow outputting)?

And what about those people who output within Application.cfm (I shudder the thought)? Should they place that tag within the file?

I ask these questions because you said "each of our coldfusion pages starts with <cfprocessingdir.../>"


it has no effect in the application.cfm (it's a compile time tag so it also can't work w/conditional logic or variables). it should be on every page including cfincludes, CFC, etc. (though i sometimes forget for CFC as we normally don't output stuff from a CFC).

as we also don't normally follow the practices you described i can't really comment on them except to repeat using cfprocessingdirective is a "good practice".


> as we also don't normally follow the practices you described i can't really comment on them except to repeat using cfprocessingdirective is a "good practice".

Hehe. I don't use those methods either. But as a developer I'm just trying to know the limitations and uses for the tag (you never know what kind of existing code a client will have you modify).

So if I'm understanding you correctly we should use this tag anywhere we plan to have html output. And best practices states to use it in every cfm and cfc file (except Application.cfm/cfc).


I've written an article on how to convert ascii reprensations of Unicode to unicode using java.nio.charset.Charset via ColdFusion. I'll have more about how to test your charset conversion for unmappable characters. Of course, best practice is to always store in ntext or nvarchar so you don't have to go through all that ;-). However, for on the fly charset conversions, this is the best way to go about it.

http://www.phillipholmes.com/?p=46

Enjoy!


phillip, thanks for the link.

by "ascii reprensations of Unicode" do you mean of the form "\udddd"? if so running them thru ResourceBundle class might be easier. you'll also have to know the original encoding intent, not always easy, some folks tend to get hysterical when it comes to that sort of stuff. you might have a look at com.ibm.icu.text.CharsetDetector (part of icu4j) to help kick start that kind of sleuthing.