Why writing truly international software is hard
You cannot trust a thing. Forget intuition. As I recently posted, you can't just go about thinking every date value is representable under the current culture. Well, there's a lot more, and I'm learning some of it the hard way. You just cannot assume a case-insensitive comparison between "bit" and "BIT" would be true. "What?" you say with utter disbelief – and with a reason. That's what I said at first, too.
It's the Turkish-I problem. In most Latin alphabets the small letter i capitalizes to I. In Turkish it's not so. The small letter i capitalizes to İ
; (capital I with a dot – a glyph never even seen in English or Finnish!). And you guessed, "our" capital I isn't i in lowercase – it's rather ı
, a small i without a dot.
As long as you're comparing user input, this doesn't probably make a difference. But what about your internal data structures such as configuration keys, table names and so on? Right… You may have buried bugs and even security holes by using string comparisons incorrectly.
Microsoft has an article on this and a few other issues in regard to .NET 2.0 – that's a good read. But once you know it, there's always another i18n surprise around the corner. Paranoia rules the day.
Next up: Hacking the stemming algorithm of a full-text-indexer to properly handle Chinese. <sigh>
September 27, 2005
В·
Jouni Heikniemi В·
Comments Closed
Posted in: Misc. programming