OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: UTF-8 BOM



Microsoft Notepad saves a BOM at the start of a UTF-8 file (as someone else
points out in another post).

As much as I'd like to dispense with the BOM since XML has no use for it, I
think the case of Notepad points out a valid use case for keeping it. One of
the strengths of XML is that it is just text and any text editor can be used
to compose XML. A non-XML aware text editor, though, has no reliable way of
recognizing the character encoding of a file without a BOM. None of the
filesystems I know of support extended attributes that can identify the
character encoding of a text file, and every one of them is saddled with a
non-Unicode legacy for character encoding (ASCII, ISO-8859-1, whatever). If
we preclude the presence of a BOM in an XML entity, then we undermine the
utility of such generalized text editors for composing XML.

I think that use case is a strong argument for keeping the BOM, in spite of
the complications it poses for current XML parsers that don't support it.

Unfortunately, I don't think Sun's Crimson parser supports the BOM. I
remember having problems with this with JAXP 1.0. I'll try to recreate a
test case this afternoon and pass on the results (I have to prepare for a
meeting, right now).

> -----Original Message-----
> From: Richard Tobin [mailto:richard@cogsci.ed.ac.uk]
> Sent: Thursday, June 14, 2001 4:24 AM
> To: xml-dev@lists.xml.org
> Subject: UTF-8 BOM
> 
> 
> The W3C XML Core WG is considering the question of whether a UTF-8
> byte-order make (BOM) is allowed at the start of an XML entity.  This
> question was raised a few weeks ago in a thread on comp.text.xml
> starting at article
> 
>   <180520011620538217%andreas.prilop@altavista.net>
> 
> We would like to determine how existing parsers handle the byte
> sequence #xEF #xBB #xBF when it appears at the start of an XML
> document or other entity.  Is it treated as a BOM (and not part
> of the text of the entity) or as a zero-width non-breaking space
> character?
> 
> We have placed a number of test cases at
> 
>   http://www.cogsci.ed.ac.uk/~richard/bomtest/
> 
> and would be grateful for feedback on how parsers handle them.  Please
> post results here in xml-dev to avoid unnecessary duplication.
> 
> We would also like to know of any editors (or similar tools) that
> generate XML documents starting with a UTF-8 BOM.
> 
> -- Richard (on behalf of the XML Core WG)
> 
> ------------------------------------------------------------------
> The xml-dev list is sponsored by XML.org, an initiative of OASIS
> <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To unsubscribe from this elist send a message with the single word
> "unsubscribe" in the body to: xml-dev-request@lists.xml.org
>