OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: (more) extensible SAX

[ Lists Home | Date Index | Thread Index ]
  • From: Eric van der Vlist <vdv@dyomedea.com>
  • To: David Brownell <david-b@pacbell.net>
  • Date: Thu, 07 Dec 2000 00:50:05 +0100

David Brownell wrote:
> Summary:  I don't see a problem here.  No federal issue, as it were;
> layering works fine already.

Sure, SAX1 and SAX2 are both working, but, like everything else, can't
they be improved ?

> > In most of the papers I can read, SAX is opposed to DOM as a pull
> > versus push.
> >
> > While this is certainly an important difference, I don't see it as the
> > main difference, but I'd rather say that the main difference is that SAX
> > and DOM are acting at different levels and that SAX is the most
> > "neutral" interface, DOM being more biased by a specific interpretation
> > of what is a XML document.
> I see the functional difference as being that SAX is a callback
> API, while DOM is basically a data structure -- and often one
> that's not particularly task-appropriate.  There are also some
> differences in the data/infoset exposed, and very significant
> ones in portability.  (DOM still has no portable bootstrap API.)

We agree on 80% of by preliminary, then...

> > Now, I'd like to go on by explaining what I think are the two weaknesses
> > of SAX.
> >
> > The first of them is that the information isn't raw enough for some
> > applications and that there is still an information loss in the
> > interpretation that is done ...
> Having looked at that issue in excruciating detail, I think it's
> typically fair to say that "some applications" want an API that
> presents lexical processing data.  SAX is a parser API, that's not
> what it was designed to address -- but a SAX2 extension could let a
> parser expose lexical data, if it wanted to go there.
> > This second (and almost opposite) one is that in some cases, there isn't
> > enough interpretation. The way SAX1 has needed to be modified to support
> > the namespaces is a good example for this and the problem is likely to
> > happen again as long as new features are added through modularization to
> > XML 1.0.
> Actually, SAX1 did not _need_ to be modified that way.  There were
> examples of doing such processing in layers above SAX1, even before
> the one that got bundled into SAX2.  That was a design choice, not
> a structural imperative.

Yes, and it has been implemented as a layer above the "core" parser
layer in AElfred 2 (as a separate class).

> > I think that both are coming from a quest to find a balance and to
> > define an API that will meet most of the needs (I could call it the "one
> > fits all" utopia) and that this issue should be addressed by adding more
> > modularity and layering rather than by adding more complexity to
> > existing methods.
> I agree about layering and modularity, but can't quite see why there
> would be any problem achieving either of those with the current SAX.
> Perhaps you're really wanting to see new layers get standardized?  :-)

That's one of my points, yes.
> > Last point, why do I call it a layered interface ?
> >
> > Because we could define on top of this a layered architecture where a
> > single event would get richer by each layer it comes through.
> >
> > The first layer could be the recognition of the basics XML productions.
> Which productions -- the lexical ones, or the grammatical ones?  I count
> two layers there.  (Evidently from its SGML heritage, XML doesn't have
> the cleanest of distinctions between those layers, but it exists.)  The
> SAX API is basically a grammatical layer.

Isn't the namespace support mixing up things, here ?

And isn't it a reason to try to have a cleanly layered approach ?
> > A second layer could be to include entities processing and well formness
> > checks.
> Actually some of the XML rules require WF checks at a lexical level,
> while some are purely grammatical or content-based.  Entities are
> basically processed in the boundary between lexical and syntactical
> processing -- "&foo;" or "%bar;" need lexical exposure, but basically
> they're invisible otherwise.  (Yes, I'm partitioning the infoset into
> classic categories there.)
> > Next layers would include namespaces and scoped attributes.
> Hmm, you omitted validation.  Though it's known that validation can
> basically be done as a layer over SAX2 ... and that any such layers
> don't actually need to be "SAX (tm)" branded.

Not necessarily, it can come just after the first very raw layer.

I know I will probably be called an heretic, but exposing this as an
interface would allow to parse "not badly formed HTML" including the
mixture exported by MS Office as HTML files.
> > I don't see anything but advantages, one of them being the extensiblity:
> > with this architecture, SAX2 would just have been a layer on top of
> > SAX1.
> >
> > Have I miss something ?
> Well, there are already SAX2 wrappers of SAX1 parsers that work
> exactly that way -- except for "optional" features.

Yes, what I propose would be a generalization of this architecture.

Thanks for your feedback.

> - Dave

See you at XML 2000
Eric van der Vlist       Dyomedea                    http://dyomedea.com
http://xmlfr.org         http://4xt.org              http://ducotede.com


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS