OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
XML parsing @ 100MB-1000MB/sec/GHz with Parallel Bit Streams

I am pleased to announce the availability of parabix-0.40, a high-performance
XML parsing engine prototype that can parse text-oriented XML document
on commodity processors at over 200MB/sec per processor GHz and 
data-oriented XML documents at speeds approaching that.    At this point, 
this includes correct parsing of correct documents and dispatch to markup 
action routines using an in-line API for XML (ilax).    As the parabix stack 
is built out to incorporate validation and object creation, I am expecting
overall performance above 100MB/sec/GHz.  With linear speed-up on
multicore processors and other improvements, 1000MB/sec/GHz is 

By way of comparison, XML Screamer (Koustalas et al, WWW 2006) performs
parsing, validation and business object creation on commodity processors at
the rate of 23-46 MB/sec per processor GHz (MB/sec/GHz), a substantial
increase over the cited rate of 2.5-6 MB/sec/GHz for traditional validating

This is very good performance for traditional character-at-a-time parsing,
taking advantage of a collection of techniques such as optimization
across layers and schema-based customization.  As a benchmark, 
100 MB/sec/GHz is cited as the limit on throughput achievable for a
simple character-at-a-time scanning loop.

My research is investigating the development of very high-speed text
processing based on a fundamentally new approach:  using parallel bit
streams to represent character data and the SIMD processor capabilities
of commodity CPUs to process these bit streams.

I have first applied these techniques to the problem of UTF-8 to
UTF-16 transcoding, to achieve end-to-end speed-up of 3X to 25X
compared with standard iconv and similar implementations.   The 
open source implementation of u8u16 is available at
http://u8u16.costar.sfu.ca/ and the results have just been
presented to ACM PPoPP 2008 in Salt Lake City.

Parabix (parallel bit streams for XML) is a research prototype that is
nevertheless being designed to become the basis for a full XML
processing stack.  The working code repository is now available
as an open source code base under OSL 3.0.   

I am hoping to accelerate development of parabix technology through the
open source model as well as continuing the academic research project
with a team of graduate students who are coming up to speed.    I have
also created a spin-off company to oversee commercial development
of the technology.

However, in the context of discussion of XML performance issues and
the next ten years of development of XML technology, I think that
the work is sufficiently well advanced to support the following advice:
Do not assume that XML processing performance is inherently limited
by the nature of present-day character-at-a-time parsing technology.
Intraregister and intrachip parallelism hold out a realistic promise of
dramatic performance improvement on commodity processors.
Robert D. Cameron, Ph.D.
Professor of Computing Science, Simon Fraser University
President and CTO, International Characters, Inc.

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS