xml-dev - SAX and Pull options: was: Penance for misspent attributes

SAX and Pull options: was: Penance for misspent attributes

[ Lists Home | Date Index | Thread Index ]

To: "'Dennis Sosnoski'" <dms@sosnoski.com>,<xml-dev@lists.xml.org>
Subject: SAX and Pull options: was: Penance for misspent attributes
From: Bill de hÓra <dehora@eircom.net>
Date: Tue, 21 May 2002 02:38:11 +0100
Importance: Normal
In-reply-to: <3CE85539.6080703@sosnoski.com>

 
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hi Dennis,

Thanks for the article pointer, good read.

> -----Original Message-----
> From: Dennis Sosnoski [mailto:dms@sosnoski.com]
> Sent: 20 May 2002 02:45
> To: Bill de hÓra
> Cc: xml-dev@lists.xml.org
> Subject: Re: [xml-dev] Penance for misspent attributes
> 
> 
> SAX is great for generic XML handling - it's easy to hook up a
> handler   for building a document representation using DOM or
> some
> other model,   for instance. It's very awkward for direct
> processing by an
> application,
> though, and I think autogenerating state machines just add 
> another layer of complexity.

I like to think that autogeneration, done well, encapsulates
complexity.

 
> The only real problem with using pull parsers right now is
> limited   availability.

I cite two other problems (maybe just nits) and two other
processing options. 


First problem: event based architectures are likely to become an
basis for building application servers, particularly as we stumble
into an era of machine to machine XML processing. Apache Axis is a
babystep in that direction, possibly a leap if and when it moves to
the non-blocking IO available in the 1.4 JDK. The problem with
placing pull based /parsing/ on top of event oriented servers is
that after working so hard to increase server throughput, you then
re-insert the processing bottleneck by virtue of the parsing being
the equivalent of blocking requests. The point isn't made against a
pull oriented /API/ per se, but if the processing must block, let
it block as late as possible, that is, just below the application. 

One way to deal with this inflection is to insert queues/buffers
between the event generation and application layer. This is a
pattern sometimes known as half sync, half async and is common
enough in operating systems where application level services
encapsulate asynchronous interrupts inside processes (who also
manage the state). The application client is provided with a higher
level view of the data and the need for the developer to manage
state in custom data structures is avoided. To some degree, part 2
of your series implements this pattern over SAX events. 


Second problem: exposing conditional logic based on switch blocks
instead of visiting is a lost opportunity. I have this static
binding and external iteration:

    public void processDocument(XmlPullParser xpp)
        throws XmlPullParserException, IOException
    {
        int eventType = xpp.getEventType();
        do {
            if(eventType == xpp.START_DOCUMENT) {
                System.out.println("Start document");
            } else if(eventType == xpp.END_DOCUMENT) {
                System.out.println("End document");
            } else if(eventType == xpp.START_TAG) {
                processStartElement(xpp);
            } else if(eventType == xpp.END_TAG) {
                processEndElement(xpp);
            } else if(eventType == xpp.TEXT) {
                processText(xpp);
            }
            eventType = xpp.next();
        } while (eventType != xpp.END_DOCUMENT);
    }

when I could have had a runtime binding based on the types of the
visitor and visitee and internal iteration (presumably the parser
is best placed to know the token type) via double dispatch. 

Essentially, the control code inside the while(true) blocks in
PullWrapper is in the wrong place; it results from an
implementation detail (typecodes). It's notable that while SAX is
not quite a visitor, it doesn't require typecodes to identify
events. Also worth mentioning is that XML Pull has 10 events, not
4, i.e. a full coverage of typecodes will look more like this:

    public void processDocument(XmlPullParser xpp)
        throws XmlPullParserException, IOException
    {
        int eventType = xpp.getEventType();
        do {
            if(eventType == xpp.START_DOCUMENT) {
                System.out.println("Start document");
            } else if(eventType == xpp.END_DOCUMENT) {
                System.out.println("End document");
            } else if(eventType == xpp.START_TAG) {
                processStartElement(xpp);
            } else if(eventType == xpp.END_TAG) {
                processEndElement(xpp);
            } else if(eventType == xpp.TEXT) {
                processOther(xpp);
            } else if(eventType == xpp.CDSECT) {
                processOther(xpp);
            } else if(eventType == xpp.ENTITY_REF) {
                processOther(xpp);
            } else if(eventType == xpp.IGNORABLE_WHITESPACE) {
                processOther(xpp);
            } else if(eventType == xpp.PROCESSING_INSTRUCTION) {
                processOther(xpp);
            } else if(eventType == xpp.COMMENT) {
                processOther(xpp);
            }else if(eventType == xpp.DOCDECL) {
                processOther(xpp);
            }
            eventType = xpp.next();
        } while (eventType != xpp.END_DOCUMENT);
    }

Most of the PULLXML examples I've seen today don't have a default
else{} block. Yet it's not hard to imagine futures for the API with
typecode creep, such as infoset/psvi extensions. 

There's potential to use polymorphism, not type codes as seen the
DOM in a pull based API (no slight intended toward the DOM
designers, who did not have a clean slate to work off to say the
least). If XMLPULL was a bespoke framework instead of a proposed
public API, replace typecode with polymorphism is one the first
refactorings that would come to mind. you could have something like
this: 

        XmlPullParser xpp = factory.newPullParser();
        xpp.setInput (reader);
	 xpp.accept(new XMLPullVisitorImpl());

where:

    class XMLPullVisitorImpl() implements XMLPullVisitor
    {
      public void visit(Start s); 
      {System.out.println("START_TAG:"+s.getName());}
      
      public void visit(StartTag s) 
      {System.out.println("END_TAG:"+s.getName()}

      public void visit(Text s) 
      {System.out.println("TEXT:"+s.getText()}
    }

instead of this:

        XmlPullParser xpp = factory.newPullParser();
        xpp.setInput (reader);
        int eventType;
        while ((eventType = xpp.next()) != xpp.END_DOCUMENT) {
            if(eventType == xpp.START_TAG) {
                System.out.println("START_TAG "+xpp.getName());
            } else if(eventType == xpp.END_TAG) {
                System.out.println("END_TAG   "+xpp.getName());
            } else if(eventType == xpp.TEXT) {
                System.out.println("TEXT      "+xpp.getText());
            }
        }

Standardizing on switch blocks just doesn't seem like a good idea
when you've got objects available.


First processing option: there are other ways to make SAX tractable
without pull or visitation and in a lightweight manner. 

I'm not altogether convinced that event-oriented is such an
unintuitive programming style to developers, though I do
acknowledge that pull-based might have more traction (the control
flow is highly visible, however that might be a warning signal in
an OO program, cf the aforementioned switch blocks). The problem is
in the associated bookkeeping and cognitive overhead of state
management. Essentially application input is chunked, as you say:

[[[
However, the framework doesn't eliminate event-driven programming's
complexity. That complexity springs from SAX2's divided
control-parsing approach: your application passes control to the
parser, which then calls handler methods within your code. The
handler methods usually need to accumulate data piece-by-piece
before they can finally do anything with the data. 
]]] 

Here's a simple outline, that I think provides most of what part 2
of your article requires in terms of SAX scaffolding:

class TagManager {
public void startTag(Stack s){}
public void endTag(Stack s) {}
public void textFrag(Stack s) {}
...
}

class OptionTradeManager implements TagManager {
 public void startTag(Stack s) {
   s.push(new OptionTrade());
 }
}

class SymbolManager implements TagManager {
 public void textFrag(String s, Stack s) {
   OptionTrade opt = (OptionTrade)s.peek();
   opt.setSymbol(append(s));
 }
 public String append(String s) {...}
}

class TrackingManager implements TagManager {
 public void startTag(Stack s) {
   s.push(new Tracking());
 }
}

class TradeHandler 
extends org.xml.sax.helpers.DefaultHandler {
 Map m = new HashMap();
 Stack s = new Stack();
 TagManager tm;
 public Map initManager() {  
   m.put("option-trade", new OptionTradeManager());
   m.put("symbol", new SymbolManager());
   m.put("tracking", new TrackingManager());
   ...
   return m;
 }
 public TagManager getManager() {
   return tm; 
 }
 public void setManager(String s) {
   tm = (TagManager)m.get(String); 
 }
 public void startElement( ... ) {
    setManager( ... );
    getManager().startTag(s);
 }
 public void characters( char ch[], int start, int len) {
   getManager().textFrag(new String(ch, start, len), s);
 }
 ...
}

It's not difficult to see how TradeHandler could be factored to a
more generally useful object. 


Second processing option: you could just bind the XML to objects:

 TradeHistory data = new TradeHistory(new
FileReader("stockdata.xml"));  OptionTrade[] opt =
data.getAll("option-trade"); // or data.getOptionTrades();

I imagine this approach might become popular in web services
programming environments and IDEs.


The interesting thing nonetheless is that in an event based
framework for XML processing, the state management for an
application can be declared and then generated using a rules based
format or simple mappings from symbols to behaviour (condition
action pairs in logic programming speak). That means we can blow
out an application's SAX handlers /or/ standardized pull logic.

Bill de hÓra


-----BEGIN PGP SIGNATURE-----
Version: PGP 7.0.4

iQA/AwUBPOmk/uaWiFwg2CH4EQLYVQCg0a25wR4D9s6w7MF3rua3+ziXyH8An3rL
dnJGTSToPpSYYki6ggPjWOOO
=cwUJ
-----END PGP SIGNATURE-----

Follow-Ups:
- RE: [xml-dev] SAX and Pull options: was: Penance for misspent attributes
  - From: "Miles Sabin" <miles@milessabin.com>
- Re: [xml-dev] SAX and Pull options: was: Penance for misspent attributes
  - From: Dennis Sosnoski <dms@sosnoski.com>

References:
- Re: [xml-dev] Penance for misspent attributes
  - From: Dennis Sosnoski <dms@sosnoski.com>

Prev by Date: Re: [xml-dev] ANN: REST Tutorial
Next by Date: Re: [xml-dev] SAX and Pull options: was: Penance for misspent attributes
Previous by thread: Re: [xml-dev] Penance for misspent attributes
Next by thread: Re: [xml-dev] SAX and Pull options: was: Penance for misspent attributes
Index(es):
- Date
- Thread