Unicode BOM as document separator [was: RE: [xml-dev]"Introducing MicroX

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Unicode BOM as document separator [was: RE: [xml-dev]"Introducing MicroXML, Part 1: Explore the basic principles of ...]

From: Jim DeLaHunt <from.xml-dev@jdlh.com>
To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Date: Sun, 15 Jul 2012 14:36:20 -0700

David:

I'm not sure how important this is to your usage, but The Unicode 
Standard already defines the meaning of a Byte Order Mark (BOM) code 
point in the midst of data. Up until Unicode 3.2, the BOM code point 
U+FEFF had the Byte Order Mark semantics at the start of a text 
stream, and the Zero-Width Non-Breaking Space (ZWNBS) semantics 
within a text stream. As such, your "<data>" element could validly 
include a U+FEFF codoe point.

As of Unicode 3.2, the ZWNBS semantics for U+FEFF are deprecated, and 
a different code point U+2060 WORD JOINER is available. But the old 
use of ZWNBS will not have disappeared, and you might encounter it in 
the wild.

Reference:
http://unicode.org/faq/utf_bom.html#bom6
http://en.wikipedia.org/wiki/Byte_order_mark#Usage

At 2:42 PM +0000 7/15/12, David Lee wrote:
...
>I had an "Ah Ha" Moment last week when I realized that the UTF8 BOM 
>could serve as such a separator.
...
>Then I realized that if I used BOM as a separator it might actually 
>work and plain XML parsers could read the degenerate case of 1 
>document.
>If every document started like
>BOM <data>
>BOM <data>
>
>Then by themselves they are valid XML documents
>If you concatenate them they become
>BOM <data> BOM <data>
>
>which a XDM Serialized capable parser could parse, and in some cases
>"dumb" parsers might just see this as 1 document and stop.

-- 
     --Jim DeLaHunt, jdlh@jdlh.com     http://blog.jdlh.com/ (http://jdlh.com/)
       multilingual websites consultant

       157-2906 West Broadway, Vancouver BC V6K 2G8, Canada
          Canada mobile +1-604-376-8953

Follow-Ups:
- Re: [xml-dev] Unicode BOM as document separator [was: RE:[xml-dev] "Introducing MicroXML, Part 1: Explore the basic principles of...]
  - From: John Cowan <cowan@mercury.ccil.org>

References:
- Re: [xml-dev] "Introducing MicroXML, Part 1: Explore the basic principles of ...
  - From: BillClare3@aol.com
- Re: [xml-dev] "Introducing MicroXML, Part 1: Explore the basicprinciples of ...
  - From: John Cowan <cowan@mercury.ccil.org>
- RE: [xml-dev] "Introducing MicroXML, Part 1: Explore the basic principles of ...
  - From: "Len Bullard" <Len.Bullard@ses-i.com>
- Re: [xml-dev] "Introducing MicroXML, Part 1: Explore the basicprinciples of ...
  - From: John Cowan <cowan@mercury.ccil.org>
- Re: [xml-dev] "Introducing MicroXML, Part 1: Explore the basicprinciples of ...
  - From: James Clark <jjc@jclark.com>
- Re: [xml-dev] "Introducing MicroXML, Part 1: Explore the basicprinciples of ...
  - From: James Clark <jjc@jclark.com>
- Re: [xml-dev] "Introducing MicroXML, Part 1: Explore the basicprinciples of ...
  - From: Uche Ogbuji <uche@ogbuji.net>
- RE: [xml-dev] "Introducing MicroXML, Part 1: Explore the basicprinciples of ...
  - From: David Lee <dlee@calldei.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]