Viewing By Category : language / Main
August 26, 2008
case shennigans
i really feel for our turkish cf brethren, they always seem to be getting the short end of the stick. a couple of weeks ago there was an issue in the support forums with someone using turkish locale (tr_TR) that was having problems getting case right using coldfusion's uCase() & lCase() functions. there's a couple of special characters, "i" & "ı" (that's small letter i & small letter dotless i) that are special cases when it comes to case mappings (bad pun willfully intended) which cf's functions weren't handling correctly.

i was a bit perplexed by this, mainly as we usually deal with locales which have writing systems that don't have a concept of case but after poking around core java's String class it seems that cf wasn't using the overloaded versions of the toUpperCase()/toLowerCase() methods which pass in a locale to use to handle locale sensitive case. easy enough to fix in cf (i really love how easily coldfusion lets you workaround these little issues):

<cffunction name="toLowerCase" output="false" returntype="string" access="public">
<cfargument name="inString" required="true" type="string" hint="string to lower case">
<cfargument name="locale" required="false" default="en_US" type="string" hint="java style locale identifier to use to lower case input string">
<cfscript>
var thisLocale="";
var l=listFirst(arguments.locale,"_"); // language
var c=""; // country, we'll ignore variants
if (listLen(arguments.locale,"_") GT 1)
      c=uCase(listGetAt(arguments.locale,2,"_"));
// build locale
thisLocale=createObject("java","java.util.Locale").init(l,c);
return arguments.inString.toLowerCase(thisLocale);
</cfscript>
</cffunction>


<cffunction name="toUpperCase" output="false" returntype="string" access="public">
<cfargument name="inString" required="true" type="string" hint="string to upper case">
<cfargument name="locale" required="false" default="en_US" type="string" hint="java style locale identifier to use to upper case input string">
<cfscript>
var thisLocale="";
var l=listFirst(arguments.locale,"_"); // language
var c=""; // country, we'll ignore variants
if (listLen(arguments.locale,"_") GT 1)
      c=uCase(listGetAt(arguments.locale,2,"_"));
// build locale
thisLocale=createObject("java","java.util.Locale").init(l,c);
return arguments.inString.toUpperCase(thisLocale);
</cfscript>
</cffunction>

<cfscript>
s="#chr(105)##chr(305)##chr(223)#";
upperS=toUpperCase(s,"tr_TR");
lowerS=toLowerCase(upperS,"TR_TR");
writeoutput("input string: #s#<br> upper case: #upperS#<br>lower case: #lowerS#");
</cfscript>

notice how i didn't have to mess with the core java String class, i could just use it's methods on a cf string.

even if you're not using tr_TR locale, you should note that "ß" (small letter sharp s) is also a special case, upper casing it actually turns it into 2 letters, "SS". i think there might also be some issues with some Greek characters as well.

July 24, 2007
Pepsi Brings Your Ancestors Back From the Grave
the wonderfully named "moronland" site has a page for the 13 worst translation mistakes. most of these should be familiar if you follow tex texin's (the unicode bulldog) Marketing Translation Mistakes.

the "moronland" page is kind of unique in that it coins the term Babelfished, as in I wonder if these companies just Babelfished the slogans into another language.

another set of examples why you should use human beings for translation ;-)

October 14, 2006
machine translation goes to war
anyone who knows me knows that i think machine translation is so much humbug. round tripping machine translated text through pretty much any of the publicly available translation engines will more often than not give you back garbage, usually indicating the original translation is also garbage. my most frequent example is the phrase “This side towards enemy” that was helpfully placed on the business end of claymore mines. well maybe IBM is about to prove me wrong, as they are deploying 35 laptops equipped with their Mastor (Multilingual Automatic Speech-to-Speech Translator) product to Iraq. Mastor employs a “combination of speech recognition, machine translation and text-to-speech rules” which “allowed IBM to develop a common translation engine that is independent of languages.“ the system has library data for english to Iraqi Arabic, modern standard Arabic and Mandarin translations--i find the language choices interesting, though no Farsi (Persian).

i just hope it works for the sake of those using it.

September 18, 2006
analysis of the olmec hieroglyphs
michael everson, a virtual language encoding machine and leading light in the unicode world, has just posted a brief analysis of the recently discovered "olmec hieroglyphs". while the analysis isn't a "decipherment", i find the way michael attacked the analysis fascinating.

August 9, 2006
help localize ColdFusion info!
tim buntel's looking for folks who want to help localize ColdFusion information including datasheets, whitepapers, and developer center articles. somebody in the Tawainese user community has already submitted some Chinese (traditional and simplified) stuff.

and i'd also like to remind folks that dean harmon's looking for help localizing the cfreport builder application.

so what are you waiting for?

October 25, 2005
language matters?
you bet it does. just ask the 20 poor slobs who had to cough up 100 new turkish lira (about $76US) each for using the letters "Q" and "W" in kurdish language placards in turkey. it seems that these letters aren't in the turkish alphabet and there's a 1928 law ("Law on the Adoption and Application of Turkish Letters") that requires all signs and what not to only use turkish letters. in case you don't already know, turkey moved from an arabic to a "modified" latin script in the 1920's, pretty gutsy thing to do. i guess they needed tough laws to push this kind of reform through.

this is all news to me....i wonder how they advertize windows there? i know we have a good group of turkish coldfusion users, care to shed some light on this guys?

from CNN.

September 8, 2005
help localize cfreport builder
dean harmon, who looks after cfreport, has reported on his blog that you can easily localize cf report builder into your own language. the language files (located under the cf report builder install dir in the Languages dir) are sort of simple key/value pairs with the values being utf-8 encoded. the key/value pairs are delimited using "tab-equal sign-tab" for instance
dragAndDrop[tab]=[tab]drag and drop
while that style's not my cup of tea (i would of course have used java style rb files) but at least you can localize report builder. shows some pretty good foresight. so you folks that spell color with a "u", here's your chance to right that terrible wrong.

January 19, 2005
Puijilittatuq?
interesting BBC news article on the "extinction" of minority languages. according to the article, one of the world's 6,000 languages will be lost every two weeks. and what's lost is sometimes irreplaceable even by one of the world's steam roller languages (english, chinese, french, etc.). for instance the Inuit language has a bunch of verbs for the word "know", covering various "flavors" of "knowing something"--"utsimavaa" - meaning somebody "knows" from direct experience to something like "nalunaiqpaa" meaning someone's "no longer unaware of something".

the article goes on to claim that welsh is a "great example", citing the existance of welsh porn. i guess they forgot about the irish.

anyway something to think about.

Puijilittatuq? why that's an Inuktitut (eskimo) word meaning "he does not know which way to turn because of the many seals he has seen come to the ice surface". man that's some kind of efficient communication.

January 6, 2005
back to our regularly scheduled i18n programming
a couple more non-tsunami i18n bits of information.

that i18n guy about town, tex texin, has put together a good document concerning the use of RFC 3066 language identifiers. you might lend a hand by perusing the table for any funny business (maybe like sinhalese in thailand--but hey, what do i know).

and just when i thought i knew everything about encoding (maybe because i actually think all you really have to know is Just Use Unicode), i find out something new. while doing some research in the java i18n forums i stumbled onto a really nifty java encoding resource, part of a java and internet glossary. i especially liked the term armouring (which i had never heard used in this context before): Converting binary data into printable gibberish so that data transport systems will not corrupt it. so that's what it's called.

October 1, 2004
what you don't know about latin-1 might hurt you
french cf users might want to pay attention to this...

there is an on-going discussion on the unicode list about "internationalization assumption" which simplistically goes something along the lines of if latin-1 is tested ok can we assume all latin-1 languages are "a-ok"? as it turns out, "no". some of the folks participating in this discussion have pointed out that, for example, not all french chars are found in latin-1. my first thought on reading that was, "oh yeah, the euro" but as it turns out there are a couple of french chars (no idea of their frequency of use but they are used in the french words for eye, egg, beef and heart) that are not in latin-1 but are in latin-9. for example see jukka korpela's excellent latin-1/latin-9 comparison page. these chars are also found in windows 1252 code page (which i guess helps support the idea that it's actually a superset of latin-1).

the moral of the story? just use unicode

September 16, 2004
YAKWS
oh my, Deutsche Welle has produced yet another klingon website (yakws). this is a decidedly bad idea. everybody knows klingons are bad drunks with long memories and i just know they're going to try to do something about unicode rejecting their encoding proposal now that they have media attention again.

i guess it's a good thing that JD's off to china. i heard they're still mad about his blogging about tengwar.

via CNN

June 16, 2004
mapping languages
wow a topic that combines my two favorite things, languages and maps. the Modern Language Association has produced an arcIMS driven site that allows interactive mapping of language down to the county/zip code level (from census data). you can compare one language against another, map language use by political boundary (GIS-speak for state, county, zip code), etc. oh its so cool, its making me giddy ;-) it's not "public" until wednesday and it looks like its getting hammered right now. in any case, if you live in the US, it's certainly worth a peek.

off the CNN website.

February 26, 2004
new W3C i18n faq
the W3C has just issued a new i18n faq related to language negotiation. it discusses the just about absolute need for language negotiation on good i18n web sites, examining the old standby of HTTP Accept-Language header (i use that in combination with geolocator CFC) as well as stressing the need for manual language swapping (couldn't agree more). another important but sometimes overlooked point is "navigation stickiness", basically remembering which language a user has selected (in cf via cookies or session vars) & always serving content in that language. another interesting point (to me anyway) was a trick to also look at User-Agent header which sometimes also contains language (besides all that boring browser version, etc. stuff). cool. i'm going to look at adding that to the geoLocator CFC when Accept-Language is empty.

so now you know.

June 24, 2003
language negotiation
one important CGI variable from the G11N perspective is HTTP_ACCEPT_LANGUAGE. why? because it represents what language/locale the user wants as opposed to what cf might be able to deliver (via setEncoding(), cfProcessingDirective, cfcontent and the actual dynamic content). matching what the user wants and what your app can deliver is often called "language negotiation".

while HTTP_ACCEPT_LANGUAGE is usually a single locale or language (th-th or th for example) it can often be a list of languages/locales (especially w/MACs, some of the longest HTTP_ACCEPT_LANGUAGE lists i've ever seen came from MAC browsers though browsers in internet cafe's at major tourist desitinations can get pretty long as well). language preferences are usually listed (comma delimited) in order, with most preferred first and may contain a quality (q) value that represents an estimate of the user's preference for that language range. for instance, "en-us,ko;q=0.5" means i prefer US english but will also accept Korean. whether a value for HTTP_ACCEPT_LANGUAGE exists depends on the browser age and whether a user has set it (for IE that would be via tools, internet options, languages), it also may only contain a language (en) rather than a full locale (en-ca) and we all know how important locale is ;-) because of this i use geoLocator (which determines locale from a users IP) along with HTTP_ACCEPT_LANGUAGE to find and fix a users locale. more info on HTTP_ACCEPT_LANGUAGE can be found here.