[
Lists Home |
Date Index |
Thread Index
]
I've written a bit of Java code that reports XML documents as a series
of text-based events with a context object reflecting the structure of
the document so far. It's not technically an XML parser, but it is
designed to be something on which you could build an XML parser or even
an XML editor.
More information is avaiable at:
http://simonstl.com/projects/gorille/
Ripper, DocProcI, and ContextI are the most relevant bits, and
RipperTest provides a command-line interface. Ripper is designed to
report every character in an XML document, from the XML declaration to
the DOCTYPE (which it doesn't process) to spaces and quote styling
inside tags to entity references to whitespace and comments at the end
of the document. This approach should make it easier to perform minimal
transformations which preserve as much of an original document as
possible, as well as custom entity handlers and character testing.
More details on why I did this and project status follow, if you're
interested. I'll also be presenting on this project at XML Europe in
May.
-----------------------------------
About four years ago I wrote an article called "Toward A Layered Model
for XML" [1]. At the time I was inspired by a variety of problems that
XML 1.0 and Namespaces in XML had created for XML 1.0 [2]. Breaking
down the parsing process into a series of smaller and better-defined
parts seemed like a possible answer to a number of complex problems.
[1] - http://simonstl.com/articles/layering/layered.htm
[2] - http://simonstl.com/articles/interop/
More recently, I've been exploring character entity processing in the
absence of a DTD [3], as well as a new problem that arose with XML 1.1,
the prospect of different rules for the characters in XML components
[4]. Both of these issues are closely tied to the parsing process, and
fixing them is difficult without writing a whole new parser.
[3] - http://simonstl.com/projects/ents/
[4] - http://simonstl.com/projects/gorille/
While SAX2, DOM, and a variety of other APIs provide access to document
information, these APIs are designed rather explicitly around the
expectation that the document will have already been parsed. For a
variety of questionable reasons I took the long way around and created
an API, Markup Object Events (MOE) [5], that was capable of storing
information in a parsed but not completely processed form. Things like
entity boundaries, CDATA sections, and additional metadata can all be
stored in this framework.
[5] - http://simonstl.com/projects/moe/
Unfortunately for MOE, there doesn't seem to be much of an audience for
Java events that can be combined into object models and vice-versa;
various tools for SAX2, DOM, and other frameworks already had that
covered. Just as important, there were no parsers around that could
provide MOE with the level of content it was capable of storing. It's
nice to be able to keep track of entities used in attribute values, but
since parsers squash them into simple strings anyway, there hasn't been
much point.
The next piece of the puzzle was the Tiny API for Markup (TAM) [6],
which included a J2ME MIDP 1.0 parser. It skipped the DOCTYPE
declaration completely, so it wasn't an XML parser, and it turns out I
forgot to implement CDATA sections anyway. In any event, while TAM
provided a simplified SAX-like view of parsed documents, it provided a
foundation on which later parsing work could build.
[6] - http://simonstl.com/projects/tam/
The latest piece of the puzzle is a part of the Gorille package but
builds on the TAM work. As a test-project building on J2ME code, it
isn't the lovely programming, but so far it does appear to work. Most
of the information on what "Ripper" produces, is presently in two
javadoc files, one covering the DocProcI interface [7] and one covering
the ContextI[8] interface. The parser feeds both interfaces with
information, sending a raw text view to DocProcI and a more Infoset-like
tree view to ContextI.
[7] -
http://simonstl.com/projects/gorille/docs/com/simonstl/gorille/DocProcI.
html
[8] -
http://simonstl.com/projects/gorille/docs/com/simonstl/gorille/ContextI.
html
My initial tests with this simple processor have shown that it's
possible to parse a document and preserve every character in it, which
is a rather expensive reinvention of the Unix cat command. Perhaps more
promising is the hope that developers can build tools which combine
textual awareness and an understanding of markup context on top of this
framework. I need to build unit tests for various pieces of the parser
and the context objects, as well as exercise the parser on a greater
variety of cases. Currently the parser only works on UTF-8 documents,
at least without intervention in Java.
Future concrete work will focus on creating layers on top of these
interfaces which integrate with the surrounding Gorille work as well as
Ents. A DOCTYPE processor which can modify both the character and
context objects will hopefully follow, as will a consumer that turns
these events into SAX2 events and MOE events.
There's a lot to do yet, and it'll be a while coming, but hopefully what
I've done might at least make other folks consider what's possible
rather than just what's easy today.
--
Simon St.Laurent
Ring around the content, a pocket full of brackets
Errors, errors, all fall down!
http://simonstl.com -- http://monasticxml.org
|