[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CDATA sections in W3C XML Infoset

From: John Cowan <jcowan@reutershealth.com>
To: Bob Kline <bkline@rksystems.com>, "xml-dev@xml.org" <xml-dev@xml.org>
Date: Fri, 30 Mar 2001 11:37:10 -0500

Bob Kline wrote:

> No?  We have quite a bit of code in our XML repository which uses XML
> commands over sockets for its client-server interface to the rest of the
> world.  Most of the commands embed an XML document being stored in or
> retrieved from the repository.  The embedded documents are wrapped in
> CDATA sections.

And when the embedded document already contains a CDATA section?  Bzzzzt,
not well-formed.

>  The logic for extracting a document from an incoming
> client command is essentially:
> 
>    Find the element containing the CDATA section.
>    Find the CDATA child of the element.
>    Hand the value of the CDATA section to the parser.

I admit this is an easy DOM-based hack.  But it shouldn't be
*that* much harder to know what element you are looking for,
pull out a Text child (initially there should be only one,
or you can normalize), and do the conversion below.

> Before you even think about suggesting how easy it would be to restore
> the angle brackets in the embedded document, let me point out that the
> &lt; and &gt; which are not delimiters for the element tags in the
> embedded document cannot be "restored" to < and >, and I submit that it
> is impossible in some cases to distinguish which those were.  Therefore
> information has been lost.

Not so if you encode properly.  By changing every "&" in the embedded
document to "&amp;" and every "<" to "&lt;" (conceptually in that order),
you get this result:

	Original	Embedding
	<		&lt;
	&		&amp;
	&lt;		&amp;lt;
	&amp;		&amp;amp;
	&amp;lt;	&amp;amp;lt

Etc. etc.  No information is lost: change every "&lt;" to "<" and
every "&amp;" to "&" (conceptually in that order) and the exact
original is restored.  In this encoding, ">" characters need not
be changed.

> Before you suggest that the embedded document should not have been
> wrapped in a CDATA section in the first place, let me say that:

[points snipped]

These points basically say that your embedded documents are text,
not necessarily XML.  The safe way to encode text in an XML document
is to use the mapping above.

-- 
There is / one art             || John Cowan <jcowan@reutershealth.com>
no more / no less              || http://www.reutershealth.com
to do / all things             || http://www.ccil.org/~cowan
with art- / lessness           \\ -- Piet Hein

Follow-Ups:
- Re: CDATA sections in W3C XML Infoset
  - From: "Simon St.Laurent" <simonstl@simonstl.com>
- Re: CDATA sections in W3C XML Infoset
  - From: Bob Kline <bkline@rksystems.com>

References:
- Re: CDATA sections in W3C XML Infoset
  - From: Bob Kline <bkline@rksystems.com>

Prev by Date: RE: SQL Generation
Next by Date: Re: CDATA sections in W3C XML Infoset
Previous by thread: Re: CDATA sections in W3C XML Infoset
Next by thread: Re: CDATA sections in W3C XML Infoset
Index(es):
- Date
- Thread