xml-dev - Re: [xml-dev] Pure syntax vs the Infoset permathread (was Re: [xml-dev]

Re: [xml-dev] Pure syntax vs the Infoset permathread (was Re: [xml-dev]

[ Lists Home | Date Index | Thread Index ]

To: XML Dev <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Pure syntax vs the Infoset permathread (was Re: [xml-dev] The subsetting has begun)
From: "Simon St.Laurent" <simonstl@simonstl.com>
Date: Tue, 25 Feb 2003 12:18:28 -0500
In-reply-to: <oprk5p7u0yezizxn@smtp.comcast.net>

mc@xegesis.org (Mike Champion) writes:
>But it further strengthens the argument that essentially nobody except
>Simon :-) and the proverbial desperate Perl hacker actually works with
>XML at the pure syntax level.

If only it were so simple.  There is a large set of problems that the
Infoset and even the PSVI do very poorly at expressing, though I think
in many ways it goes back to bad layering (more precisely no layering)
in the XML 1.0 specification.

The problems largely have to do with information that comes from the
DOCTYPE declaration (or other annotative source) and is inserted into
the document between the reading of the bytes and the presentation to
the application of the infosettish-API.  Many applications, if asked to
round-trip an XML document, save out an infosettish document rather than
the original.  The DOCTYPE is gone, entities are flattened into the
text, default attributes are presented there, etc.  If any of that
information mattered to you, you're stuck.  

This does happen pretty easily.  Recently, I accidentally overwrote a
book.xml file which had referenced huge volumes of chapter files.  That
was a serious mess, but fortunately I still had the original in a zip.

(Of course, if it didn't read external resources and then saved it out
without the DOCTYPE, it might even be worse, but I haven't seen that
case much.)

Entities are probably the case where staying close to the syntax
matters.  I may well not want my special characters as numeric character
references or straight Unicode text.  In the case of books with
chapters, I may want to retain the ability to edit chapters without
digging into the whole @#X! book file.  We do have some nifty tools,
notably catalogs, which simplify dealing with these things, but they're
not much good when the DOCTYPE's just plain stripped.

Default attributes are less of a problem, though I have heard of people
who change document processing context (different kinds of editors, for
instance) using different DOCTYPE declarations.  If it weren't that
DOCTYPE-sniffing has become such a common part of browsers I might write
this off as an odd approach (stylesheets seem more appropriate), but
there's something there.

A lot of the people dealing with these problems are users of software,
not programmers, and I worry that a lot of them are just giving up.
"How I Learned to Stop Worrying and Love the &#x20AC;" or something like
that.  

I wrote a piece on some of this a long time ago:
http://simonstl.com/articles/layering/layered.htm

I'm only just now getting to implementation, unfortunately.

-- 
Simon St.Laurent
Ring around the content, a pocket full of brackets
Errors, errors, all fall down!
http://simonstl.com -- http://monasticxml.org

References:
- Pure syntax vs the Infoset permathread (was Re: [xml-dev] Thesubsetting has begun)
  - From: Mike Champion <mc@xegesis.org>

Prev by Date: Some notes on the binxml permathread (was: Re: [xml-dev] Parsingefficiency? - why not 'compile'????)
Next by Date: Re: [xml-dev] Parsing efficiency? - why not 'compile'????
Previous by thread: Re: [xml-dev] Pure syntax vs the Infoset permathread (was Re: [xml-dev] The subsetting has begun)
Next by thread: Re: [xml-dev] Pure syntax vs the Infoset permathread (was Re:[xml-dev] The subsetting has begun)
Index(es):
- Date
- Thread