OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CDATA sections in W3C XML Infoset

Bob Kline wrote:

> No?  We have quite a bit of code in our XML repository which uses XML
> commands over sockets for its client-server interface to the rest of the
> world.  Most of the commands embed an XML document being stored in or
> retrieved from the repository.  The embedded documents are wrapped in
> CDATA sections.

And when the embedded document already contains a CDATA section?  Bzzzzt,
not well-formed.

>  The logic for extracting a document from an incoming
> client command is essentially:
>    Find the element containing the CDATA section.
>    Find the CDATA child of the element.
>    Hand the value of the CDATA section to the parser.

I admit this is an easy DOM-based hack.  But it shouldn't be
*that* much harder to know what element you are looking for,
pull out a Text child (initially there should be only one,
or you can normalize), and do the conversion below.

> Before you even think about suggesting how easy it would be to restore
> the angle brackets in the embedded document, let me point out that the
> < and > which are not delimiters for the element tags in the
> embedded document cannot be "restored" to < and >, and I submit that it
> is impossible in some cases to distinguish which those were.  Therefore
> information has been lost.

Not so if you encode properly.  By changing every "&" in the embedded
document to "&amp;" and every "<" to "&lt;" (conceptually in that order),
you get this result:

	Original	Embedding
	<		&lt;
	&		&amp;
	&lt;		&amp;lt;
	&amp;		&amp;amp;
	&amp;lt;	&amp;amp;lt

Etc. etc.  No information is lost: change every "&lt;" to "<" and
every "&amp;" to "&" (conceptually in that order) and the exact
original is restored.  In this encoding, ">" characters need not
be changed.

> Before you suggest that the embedded document should not have been
> wrapped in a CDATA section in the first place, let me say that:

[points snipped]

These points basically say that your embedded documents are text,
not necessarily XML.  The safe way to encode text in an XML document
is to use the mapping above.

There is / one art             || John Cowan <jcowan@reutershealth.com>
no more / no less              || http://www.reutershealth.com
to do / all things             || http://www.ccil.org/~cowan
with art- / lessness           \\ -- Piet Hein