OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: expat whitespace weirdness?

[ Lists Home | Date Index | Thread Index ]
  • From: Lars Marius Garshol <larsga@garshol.priv.no>
  • To: "'xml-dev@xml.org'" <xml-dev@xml.org>
  • Date: Mon, 17 Jul 2000 14:16:52 +0200

* Tim Crook
| I was looking around to see if there might have been a particular
| reason why expat was implemented such that no leading white space is
| allowed before the standard <?xml version="1.0" ?> line. 

The reason is that the XML recommendation requires it. :-)

| From my understanding of things, the Byte Order Mark is what allows
| an XML parser to determine which character set in use. 

Not really. It allows a parser to determine whether UTF-16 was used,
and if so which variety of UTF-16 (BE or LE). However, if UTF-16 is
not used then the encoding can basically be anything.

| (see Appendix F, Autodetection of Character Encodings in
| http://www.w3.org/TR/REC-xml) If the Byte Order Mark is not found,
| shouldn't the starting content of the data stream be discarded until
| the Byte Order Mark is located?

If the BOM is not at the beginning of the data stream then there most
likely isn't one, for example because iso-8859-1 was used. This is
what makes it so handy that the XML declaration must appear first in
the document if it appears at all.

The rules then become something like:

 a) does the stream begin with a BOM? if yes, assume UTF-16
 b) does the stream begin with an XML declaration (in some encoding
    that the parser is able to figure out)? if yes, see what the
    encoding pseudo-attribute says.
 c) assume UTF-8

--Lars M.


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS