UTF-8 preamble is a problem when you concatenate files
You’re just changing a couple of words in an XML file with Notepad. Your data modifications are guaranteed to be valid by schema. That couldn’t possibly break anything, could it?
<insert the ugly buzzer sound>
It quite likely couldn’t, unless you were editing an XML file that happened to be using UTF-8. Because while Notepad certainly looks like a very innocent, raw data text editor, it really isn’t when it comes down to UTF-8 encoding.
Files encoded in UTF-8 can contain a Byte-order mark (BOM), also known as a preamble or a signature. It consists of the bytes 0xEF, 0xBB and 0xBF right at the start of the file, and identifiers the encoding of the text file. If you ever see “”, it’s the usual visual interpretation of an unparsed BOM, although other character sets can lead to other kinds of misrepresentations.
Why is this a problem?
Normally, it’s not. Most modern UTF-aware consumers (XML parsers, text editors etc.) understand the BOM just fine, although some problems exist particularly in Unix environments. But if files get concatenated together as binary, the BOM gets embedded in the middle of the file – turning into just normal data.
So, we had strange application somebody a long time ago had written. It created XML files by concatenating together various strings and XML files. The files were pushed into the ASP.NET Response stream by simple Response.Writes and Response.WriteFiles.
At this point, you probably guessed the rest. Somebody went ahead and edited one of the XML files (changing those classic “just two words”) that got added through Response.WriteFile, which is a binary operation… And boom, you have invalid data in your XML file. In this case, the file had always before been edited in a text editor that didn’t add the preamble, but Notepad did.
Removing the BOM
It’s really as trivial as just removing the first three bytes of the file, but unless you happen to have tools for that at your disposal, paste the stuff into an editor that does not add the BOM. Alternatively, use a more sophisticated editor that allows you to choose if you want a preamble or not.
For example, in Visual Studio, you can just choose File > Save As, then drop down the Save button and choose “Save with Encoding”. After that, you’ll have a dialog with lots of options, including “Unicode (UTF-8 without signature)” as well as a “Unicode (UTF-8 with signature)” one.
If you ever need to do this in your own code, the .NET StreamWriter has a constructor that lets you choose whether or not to use the BOM. The default is false, and since most Framework methods use Encoding.UTF8 as the default encoding, BOMs get removed by just reading data in and then writing it back out.
December 21, 2009
· Jouni Heikniemi · 4 Comments
Tags: charset · Posted in: .NET, Misc. programming
4 Responses
Timo Laak - December 21, 2009
Rule #1: Do not use Notepad. It sucks. Use Notepad++, UltraEdit or something more advanced editor instead.
And it is very strange that Visual Studio does not use Unicode by default. MS really should abandon the Windows-1252 encoding.
Aki Björklund - December 21, 2009
Creating XML by concatenating strings is pretty stupid (but also pretty common). I would not blame the tool in this one.
Henri Sivonen: HOWTO Avoid Being Called a Bozo When Producing XML
Jouni Heikniemi - December 21, 2009
I definitely agree XML shouldn't be considered a string and that the code sucks.
On the other hand, the BOM-in-the-middle-of-the-file issue is not limited to XML: The same would appear in plain text as well.
Aki Björklund - December 21, 2009
There is no such thing as plain text.
Leave a Reply