OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Exploiting multi-core CPUs during XML parsing

[ Lists Home | Date Index | Thread Index ]
  • To: xml-dev@lists.xml.org
  • Subject: Re: [xml-dev] Exploiting multi-core CPUs during XML parsing
  • From: Tatu Saloranta <cowtowncoder@yahoo.com>
  • Date: Sat, 1 Apr 2006 14:23:52 -0800 (PST)
  • Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=j4PpXQoNO6c/ePCS65EZJQcxYukAAi01isFOqVAWm2OVoimD8qJJ30d6oinDDWN/oSC8eXsJoWaQD3UN0o0Td3o73RqJdIZnwFBdSvq44bUFsdzEWD3gj2i+K8B3qmVzPVIlTmVAZC6DbUe9hHuC1Cvq+MAxnWunEOsSERfrxGk= ;
  • In-reply-to: <442E2A2A.1070003@propylon.com>

--- Sean McGrath <sean.mcgrath@propylon.com> wrote:

> I have sketched out an algorithm for fast XML WF
> parsing utilising two 
> threads that each start at opposite ends of the
> octet-stream and meet in 
> the middle. The algorithm hinges on the fact that
> start- and end-tags 
> are balanced. i.e. as one thread reads forward
> looking for foo 
> start-tag, the other thread is reading backwards
> looking for foo end-tag.
> This also has the nice side effect of giving you
> accurate error messages 
> quickly. i.e. as soon as a mismatched tag is found,
> it can be reported. 

Keep in mind that in general you have lots of
subtrees, so you can not really assume that one thread
only matches start elements, and the other end
elements. So you don't really know which start tag
matches which end tag, before parsing the whole file.

> This is particularly useful with recursive element
> types.

This could work for a limited subset of XML, but there
are a few gotchas. Some problems you may face are:

* Namespace resolution pretty much has to be done in
document order
* Entity expansion is tricky to do; you need to
backtrack (from reverse reader) when hitting a '&'.
 Also, when resolving external entities, you may have
to read the whole external entity in-memory first, to
find the end (from reverse reader).
* CDATA sections need special handling. Fortunately,
]]> is not allowed anywhere in textual content, so you
can match that. However, more serious problem is how
to match the opening delimited, since that is allowed
to be repeated in CDATA, like:
"<![CDATA[<![CDATA[<![CDATA ...]]>"
* Processing instructions (like CDATA) are a pain to
parse, since they can have as many start markers as
they want ("<? proc instr <? <? <? <? ... .?>").
* Comments are quite easy, fortunately, as they can
not contain '--'.
* Handling of internal subset probably has to be done
in forward (non-reverse) order. Not a huge issue
(since it's near the start, but something to keep in

Also, another question is how are you planning to
combine the results? I assume you'd build a DOM
(-like) result tree, since it's not easy to think of a
useful stream abstraction.

Anyway, it may still be a useful exercise in figuring
out how to do things, although chances are there may
not be significant speedup in the end ;-)

-+ Tatu +-

Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS