OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A simple guy with a simple problem

3/15/01 3:53:48 AM, "Steven E. Harris" <sharris@speakeasy.org> wrote:

>Relying upon preservation of the source input stream's peculiarities
>simply doesn't constitute a "robust" usage - or expectation of -
>XML. The XML you'd get out the other side of a SAX-level filter would
>still produce the same results if re-parsed¹, so why impose this
>syntactic preservation requirement?
>¹ Unless we're trying to do some "compression" by using entities to
>  avoid repeating long strings inline. But I digress.

In general, there *is* a class of applications that do have to preserve, as 
much as possible, the lexical properties of the input stream.  I'll 
generically refer to them as "editors," though this shouldn't be understood 
only as humans-type-on-a-screen applications; it would also encompass "stream" 
or "batch" editors (think sed).  They account for the minority of 
applications; most applications are naturally "structure driven" as the 
SGMLers put it or "infoset-driven" in modern terms, and therefore *must* not, 
as you say, be sensitive to lexical details (remember WML, where the syntactic 
role of a dollar sign depends on whether it was expressed literally or as a 
numeric character reference?  Ick!).

But editor-type applications do exist, and they need more information than 
something like SAX can provide.  That's not an argument for burdening SAX with 
requirements to report all sorts of lexical details; as I said, most 
applications are going to be structure driven, and they need a *simple* API 
for XML.  But for editor applications (here's my favorite example; let's say 
we have a tool that looks up abbreviated bibliographic references in a 
document and replaces them by full references.  If I have a book, physically 
organized into entities corresponding to chapters, I'd like to be able to run 
it through the tool without losing my chapter organization; I do *not* want 
the thing to come out as one giant lump of text) it would be nice to be able 
to work with the document in a way that isn't completely structure-blind (like 
pure regex processing).  I'm not sure, though, what sort of representation 
would be appropriate.