xml-dev - Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator

Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
[ Lists Home | Date Index | Thread Index ]
To: "John Cowan" <jcowan@reutershealth.com>, <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
From: "James Clark" <jjc@jclark.com>
Date: Tue, 18 Dec 2001 12:34:10 +0700
References: <3C1E6D9E.3070401@reutershealth.com>
> This is a first design for XMLIterator, a third base-level API
> which allows an application to pull content from XML.  This
> avoids the memory demand and navigation issues of DOM, and
> is a more straightforward programming model than SAX, which
> requires magic data connections between the event handlers in
> order to maintain application state.  XMLIterator extends
> the familiar Iterator interface, so it models an XML document
> as a linear collection of partially specified nodes.

I very much agree that we need such an API.  SAX works great for some
kinds of application.  In particular, it works well for generic XML
applications which do not have to parse a particular XML vocabulary.
However, SAX is really awkward for some applications, particularly
applications that parse a particular XML vocabulary with a complex,
highly nested structure.

As it happens, I have been working on a similar API for the last few
months.  One impetus for doing this was my experience in implementing
Jing. I was struck by how painful it was to parse a RELAX NG schema
into an internal form using SAX.  The equivalent non-XML syntax was
easily parsed using a straightforward recursive descent parser.  By
contrast, the parser for the XML syntax was a warped and twisted mess.

My API is currently called "pullax" (pull API for XML). This is still
very much work in progress.  I hadn't been planning to release for a
month or two yet.  But since you have started this discussion, I think
the most constructive thing I can do is to release what I have now.  I
do have quite a comprehensive API and I do have a fairly complete
sample implementation.  I have made this available at

  http://www.thaiopensource.com/pullax/

I chose to do my initial sample implementation on top of Xerces 2
because it provides a native interface (XNI) with a "pull" parser
API. (I would call it a "controlled push" rather than a "pull"
API. Roughly, it has a variant of XMLReader.parse which you call
multiple times; on each call, it parses some portion of the document
making SAX-like callbacks on handlers.)  This allows an implementation
that neither requires the whole document in memory (as would an
implementation on top of DOM), nor the use of threads (as would an
implementation on top of SAX).  XNI also provides a very rich set of
information. You'll need Xerces 2 Beta 3 if you want to play with my
implementation.  See

   http://xml.apache.org/xerces2-j/index.html

Obviously, SAX and DOM adapters are on my list of things to do.

The bad news is that the API documentation is pretty pathetic at the
moment and still needs a lot of work. This message will have to serve
as an overview of the API for now.

In designing pullax, I have tried to follow modern Java best
practices, for example, in favoring immutability and using classes for
type-safe enumerations. One of my main guides here has been Joshua
Bloch's book "Effective Java"
(http://java.sun.com/docs/books/effective/).  This is a truly
excellent book done by the guy who designed several of the better
recent Java platform APIs (including the Collections API).

Perhaps the most fundamental decision in designing a pull API is
whether the properties for each node are provided

(a) by methods on some sort of node object returned by the
scanner/parser/iterator object

(b) by methods on the scanner/parser object itself; the scanner/parser
object has methods to move to the next node

You've chosen (a).  A couple of notable pull APIs use (b):

- the XmlReader API in .NET; this is the principal XML parser API for
.NET (see
http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystemxmlxmlreadercl
asstopic.asp)

- XML Pull Parser (http://www.extreme.indiana.edu/soap/xpp/)

I tried it both ways in pullax.  I ended up, like you, with (a), for
the following reasons:

1. Handling attributes in (b) is messy

2. (a) works more like the java.util.Iterator and
java.util.Enumeration that are familiar to every Java programmer

3. (a) makes it much easier to construct filters/processing pipelines;
for example, writing a RELAX NG validator that wraps around a
non-validating parser.

The main argument against (a) is that it involves more object
creation, which, according to Java folklore, is a performance killer.

Now, you've minimized object creation by having next() implicitly
invalidate any previously returned nodes. I don't think this is an
acceptable design for an API intended for widespread public use:

1. It's a common requirement to need to lookahead in the document when
deciding how to process the current node.  Your design makes this
awkward.  It also makes it very awkward to write a filter that needs
lookahead in doing its filtering (imagine a filter that merges
adjacent text nodes).

2. This behavior would be a big surprise to the average Java user.
The Iterators and Enumerations which a typical Java user will be
familiar with just don't work like this.

3. It's the kind of API that leads to "Write Once, Debug Everywhere"
rather than "Write Once, Run Everywhere".  A typical scenario is that
a user writes an application that needs lookahead; they incorrectly
access an XMLNode object after another call to next(); they test their
application with an implementation that allocates a new XMLNode object
for each next() call; their application appears to work fine. Then
somebody else tries to use the application with a parser
implementation that reuses XMLNode objects and the application
mysteriously and silently gives the wrong results.

In summary, this design does not promote reliability.  I believe
priority should be given to reliability over performance.

My "solution" is simply to accept the object creation.  Modern Java
VMs (like Hotspot) do a fantastic job of efficient allocation of
short-lived objects; object creation has much less performance
overhead with modern VMs than it used to with classic VMs.  In any
case, a user that is prepared to sacrifice programming convenience for
an extra ounce of performance can use SAX. (Also, since the objects
returned are immutable, there is an opportunity for reducing object
creation by sharing.)

The central interface in my API is XmlScanner. (I'm planning a
companion XmlPrinter interface for writing XML.) This corresponds to
your XMLIterator interface.  This interface is similar to
java.util.Iterator but I chose not to derive XmlScanner from Iterator,
for two reasons:

1. the equivalents of the next() and hasNext() methods need to be
able to throw a java.io.IOException

2. it's awkward and inefficient to have always to cast the return
value of next()

My XmlScanner object returns XmlItem objects.  I call these objects
"items" rather than "nodes" because "node" to me suggests a tree view
where elements have children rather than a flat view with
start-element and end-element objects.

My XmlItem object has similar methods to your XMLNode object to return
the item type, the local name, namespace URI, QName, prefix, value
etc.  The method names are chosen based on the Infoset and XPath.

I toyed with the approach to attributes that you took, that is, having
ATTRIBUTE items following the START_ELEMENT item. This has the
advantage of being simple. However, I found it inconvenient to work
with and felt it would seem rather strange to anybody with exposure to
SAX or DOM.  So instead an XmlItem of type START_ELEMENT has
getAttribute() methods that return an XmlItem for an attribute
identified by name or index.

XmlItem has a getContext() method returning an XmlContext object.
This provides information about the context of the item, such as the
in-scope namespaces.  Typically, many XmlItem objects can share the
same XmlContext object.

A major challenge in designing a general-purpose XML API is to deal
with the diversity of XML applications.  At one end of the spectrum
are simple applications that need no more than elements, attributes
and text (the "holy trinity of XML" as I think David Megginson once
called them).  At the other end of the spectrum are applications such
as XML editors that want as much detail about the markup as they can
get including things like comments and entities.  Just as there is a
diversity of XML applications, so is there a diversity of XML
processors/parsers.  There are large, complex parsers like Xerces that
a very rich set of information but take a corresponding hit in terms
of size and speed.  There is also a need for simpler parsers that do
less but can be smaller and faster.

The solution I use in pullax is based on the "feature" concept of
SAX2.  An implementation of the pullax API implements the
XmlScannerFactory interface. By default an XmlScanner created by an
XmlScannerFactory returns exactly three types of XmlItem:
START_ELEMENT, END_ELEMENT, TEXT.  Also by default TEXT items are
maximal.  So, for example, the document

  <doc>4<!-- a silly comment -->2</doc>

will be returned as three items: a START_ELEMENT item, a TEXT item
with string value "42", and an END_ELEMENT item. If an application
wishes to see, for example, comments, it must request the SHOW_COMMENT
feature from the XmlScannerFactory before creating the XmlScanner.  If
the parser cannot satisfy the request, it must throw an exception.
XmlScannerFactory objects are designed to be dynamically discoverable
using the service provider mechanism (like JAXP).
XmlScannerFactoryFinder is a utility class that takes a set of
features and dynamically finds an XmlScannerFactory implementation
that supports those features.  This approach ensures that the support
for a rich information set in pullax does not get in the way of simple
applications or simple XML processors.

The pullax API aims to provide a very rich information set.  As far as
the document instance is concerned, it is intended to support the
union of SAX2, DOM2 core, and the XML infoset and then some.  As far
as the DTD is concerned, pullax currently provides approximately the
same information as the union of the XML Infoset and DOM Level 2 core.
I have opted not to provide the detailed lexical information about the
DTD that SAX2 provides. It seems to me that it is not much use having
lexical information about DTDs if you lose information about parameter
entities within declarations; but dealing with parameter entities
within declarations is just too hard for a general-purpose API,
especially when consider nested parameter entity references. I believe
DTD editor type applications really require specialized APIs and
parsers (eg DTDinst see http://www.thaiopensource.com/dtdinst).

Another respect in which pullax's approach to DTDs differs from SAX is
that it represents the DOCTYPE declaration as a single item.  There
does seem much point in breaking it down into a multiple items.  Most
of the information is in the XmlDtd object which is available from the
XmlContext.  Note that the XmlDtd object is immutable.  I'm planning
to extend the API to allow straightforward DTD caching: the idea is
that a user-supplied XmlDtdResolver object will map the system id,
public id and internal subset to an XmlDtd object.

I've written too much already.  I'll be happy to answer any questions
people may have about the design and I'll try to get the API doc into
shape as soon as possible.

James
Follow-Ups:
- RE: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
  - From: "Chris Fry" <cfry@bea.com>
- Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
  - From: "Clark C . Evans" <cce@clarkevans.com>
- Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
  - From: "Rob Lugt" <roblugt@elcel.com>
References:
- DESIGN PROPOSAL: Java XMLIterator
  - From: John Cowan <jcowan@reutershealth.com>
Prev by Date: Re: [xml-dev] Beauty and markup (was Re: terra incognita)
Next by Date: Re: [xml-dev] terra incognita
Previous by thread: RE: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
Next by thread: Re: [xml-dev] DESIGN PROPOSAL: Java XMLIterator
Index(es):
- Date
- Thread