OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ANN: Regular Fragmentations



Back in April I suggested that regular expressions might be a useful
tool for fragmenting XML 'molecule' content into smaller pieces which
could then be processed as 'atoms':
http://www.xml.com/pub/a/2001/04/25/deviant.html

I've finally found the time to put together an implementation of this
approach, building a SAX2 filter which uses an XML configuration file
and the regular expression support built into the Xerces parser.  As
content passes through the filter, elements identified by the
configuration file are processed and broken down into smaller elements
using rules built on regular expressions.

This filter is written in Java (1.3) and requires the Xerces parser.
I've released it under the Mozilla Public License (MPL) and plan to
continue developing it in the directions noted in the documentation.
This release is version 0.02 and I don't make extensive claims for its
stability, though it works quite well on the tests I've fed it.

The regular expression package in the Xerces parser is largely compliant
with the regular expression language defined in Appendix F of XML Schema
Part 2: Datatypes.  (I'm still trying to determine how much this
implementation differs from other regular expression approaches, but my
experiments are only really getting started.)  You can use the recursive
feature built into the processor to perform multiple-level fragmentation
if necessary.

The "Regular Fragmentation" package is available from:
http://simonstl.com/projects/fragment

Documentation is still primarily javadoc, though an overview provides
examples and some explanation.  A list of planned improvements is at the
end of the overview, and probably the most notable improvement planned
is support for attribute content and content identification.  Currently
only element content is processed, and the rules only support
identification through element names.  (It is namespace-aware.)

Comments, suggestions, and contributions are welcome, either privately
or to the xml-dev mailing list.

Simon St.Laurent
Associate Editor 
O'Reilly & Associates
http://simonstl.com