eclipse (not cfeclipse) doesn't add a BOM to UTF-8 encoded files. why? well
- the BOM isn't actually required as part of the definition of UTF-8 (and i know of plenty of s/w that either doesn't write one out or in fact strips them from files)
- in the past (i think) the java compiler wouldn't compile a file w/a BOM & since that's what eclipse was originally meant for, NOT having a BOM makes perfect sense (from a very a quick test i just ran it seems this is no longer true, at least from within eclipse)
so why was our cfeclipse-edited UTF-8 encoded code working? because we follow our own good i18n practices and liberally use encoding hinting starting with the cfprocessingdirective. each of our coldfusion pages starts with:
BOM or no BOM, this ensures your code will be always be interpreted as UTF-8. for more good i18n practices grab a copy of the advanced coldfusion book.
see? good i18n practices really are good.
Hmm, reading CF7 dev guide PDF quickly I saw mentions of default output in UTF-8, and that CF looks for BOM when loading .cfm files, but nothing about default encoding if no BOM is in place.
But there is a way to set/override the default encoding:
"Default to the JVM default file character encoding. By default, this is the operating system default character encoding. To specify the JVM default file character encoding, use the -Dfile.encoding= switch in the JVM Arguments field of the ColdFusion MX Administrator Java and JVM Settings page."
=========
1) Use the BOM, if specified on the page. Macromedia recommends that you use BOM characters in your files.
2) Use the pageEncoding attribute of the cfprocessingdirective tag, if specified.
3) Default to the JVM default file character encoding. By default, this is the operating system default character encoding. To specify the JVM default file character encoding, use the -Dfile.encoding= switch in the JVM Arguments field of the ColdFusion MX Administrator Java and JVM Settings page
=========
for number 3, that could be pretty much anything depending on the localized OS but is often windows-1252 a superset of iso-8859-1 (latin-1) and a real PITA for many people.
egarding the use of the tag <cfprocessingdirective pageencoding="utf-8">:
If using a framework that runs your site from one file (i.e. index.cfm), can you just place that tag once at the top of the first file (either index.cfm, Application.cfm, or Application.cfc) or should it be in every include and custom tag? What about CFCs (for people who output within their methods... Of course I don't output within methods, but methods still do allow outputting)?
And what about those people who output within Application.cfm (I shudder the thought)? Should they place that tag within the file?
I ask these questions because you said "each of our coldfusion pages starts with <cfprocessingdir.../>"
as we also don't normally follow the practices you described i can't really comment on them except to repeat using cfprocessingdirective is a "good practice".
Hehe. I don't use those methods either. But as a developer I'm just trying to know the limitations and uses for the tag (you never know what kind of existing code a client will have you modify).
So if I'm understanding you correctly we should use this tag anywhere we plan to have html output. And best practices states to use it in every cfm and cfc file (except Application.cfm/cfc).
http://www.phillipholmes.com/?p=46
Enjoy!
by "ascii reprensations of Unicode" do you mean of the form "\udddd"? if so running them thru ResourceBundle class might be easier. you'll also have to know the original encoding intent, not always easy, some folks tend to get hysterical when it comes to that sort of stuff. you might have a look at com.ibm.icu.text.CharsetDetector (part of icu4j) to help kick start that kind of sleuthing.

