Thursday, November 16, 2006

Omitting the Byte Order Mark while saving XML Document

Have you every opened up an XML file in note pad and noticed those little gobbly gook characters preceding the XML declaration?  Well my friend you have encountered the ever elusive Byte Order Mark, or BOM for short.  I prefer Byte Order Mark because BOM sounds dirty, kind of like scrum.  Eck, dirty. 

The Byte Order Mark is basically three characters that are added to the beginning of xml files to denote their encoding.  I know... I know...  You can declare the encoding in the xml declaration.  That's what it is there for, right?  To declare stuff. 

Well, that is only partly true.  When an XML parser reads an XML file, the W3C defines the following three rules to decides how the document should be read:

  1. If there is a Byte Order Mark the Byte Order Mark defines the file encoding.
  2. If there is no Byte Order Mark, then the encoding attribute in the XML declaration is definitive.
  3. If there are neither of these, then assume the XML document is UTF-8 encoded.

I think that I remember reading an article once that claimed the Byte Order Mark was born out of Windows NT, but I can't find it now.  Either way, you are bound to come across some service somewhere that doesn't like it, probably because the service thinks that it is dirty.  All strings in .NET are encoded to UTF-16 by default. If you build and XmlDocument and save it, it will be UTF-16 encoded.  And by default, there will be a silly little Byte Order Mark.  Visual Studio, like most current xml editors, won't show it to you, but its there. 

Here's how I prevent the Byte Order Mark from appearing in my generated xml files.

public void WriteXmlFile(XmlDocument xdoc)
System.Text.Encoding enc = new UTF8Encoding(false);
XmlWriter w = new XmlTextWriter("NewFile.xml", enc);

When I create the UTF8Encoding object, I pass in false for the encodingShouldEmitUTF8Identifier parameter.  This will omit the Byte Order Mark from the NewFile.xml file. 

1 comment:

Anonymous said...

Shouldn't it set to true and not to false?
Since false!= omit