[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: A simple guy with a simple problem
- From: Eric Bohlman <email@example.com>
- To: "Steven E. Harris" <firstname.lastname@example.org>, email@example.com
- Date: Fri, 16 Mar 2001 02:20:55 -0600
3/15/01 3:53:48 AM, "Steven E. Harris" <firstname.lastname@example.org> wrote:
>Relying upon preservation of the source input stream's peculiarities
>simply doesn't constitute a "robust" usage - or expectation of -
>XML. The XML you'd get out the other side of a SAX-level filter would
>still produce the same results if re-parsed¹, so why impose this
>syntactic preservation requirement?
>¹ Unless we're trying to do some "compression" by using entities to
> avoid repeating long strings inline. But I digress.
In general, there *is* a class of applications that do have to preserve, as
much as possible, the lexical properties of the input stream. I'll
generically refer to them as "editors," though this shouldn't be understood
only as humans-type-on-a-screen applications; it would also encompass "stream"
or "batch" editors (think sed). They account for the minority of
applications; most applications are naturally "structure driven" as the
SGMLers put it or "infoset-driven" in modern terms, and therefore *must* not,
as you say, be sensitive to lexical details (remember WML, where the syntactic
role of a dollar sign depends on whether it was expressed literally or as a
numeric character reference? Ick!).
But editor-type applications do exist, and they need more information than
something like SAX can provide. That's not an argument for burdening SAX with
requirements to report all sorts of lexical details; as I said, most
applications are going to be structure driven, and they need a *simple* API
for XML. But for editor applications (here's my favorite example; let's say
we have a tool that looks up abbreviated bibliographic references in a
document and replaces them by full references. If I have a book, physically
organized into entities corresponding to chapters, I'd like to be able to run
it through the tool without losing my chapter organization; I do *not* want
the thing to come out as one giant lump of text) it would be nice to be able
to work with the document in a way that isn't completely structure-blind (like
pure regex processing). I'm not sure, though, what sort of representation
would be appropriate.