OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Exploiting multi-core CPUs during XML parsing

[ Lists Home | Date Index | Thread Index ]

On Sat, Apr 01, 2006 at 11:00:24AM +0100, Andrew S. Townley wrote:
> On Sat, 2006-04-01 at 10:28, Elliotte Harold wrote:
> > What about memory mapped files? If you can treat the file as an array, 
> > it's just as easy to move backwards through the array as forwards.
> I thought about that (specifically the use of the facilities in the nio
> package), but I figured that the original request was targeted at
> environments other than Java.  Right now, Java's direct support for
> multi-core CPUs is a little lacking (ref the parser thread from last
> week or so).  If it was Java, I agree, you could use the approach you
> suggest.

Actually memory mapped files are supported in many (most?) modern
operating systems, including Linux, *BSD, Microsoft Windows,
MacOS, OS X, Solaris, RSX11 [1], etc.

They are used at the C (or assmbly or C++) level.

Writing an efficient XML parser that's as fast as possible on a
given platform generally requires platform-specific techniques,
because you need to know things like file system throughput compared
with CPU speed.  I've used systems where the network was faster
than the local hard drive, too.

But one could target a wide range of systems and still get something
faster than most of today's parsers.  For example, you could have
a namespace manager thread, a read-ahead thread (for memory mapped
files with mmap this involves accessing a byte or word in the next
block), and a main worker thread.

Reading files backwards is actually reasonably efficient on most
Unix-like systems, by the way -- they have had a block-level file
system cache for the better part of 30 years.

Multiple cooperative threads reading forwards is probably easier to
write, and since a single page fault is likely to last far longer
than the time to parse a block (e.g. 512 bytes or 4K, depending on
the system) of XML, readahead is more effective.  Some systems will
do readahead for you automatically when you access a file sequeentially.

> The information in this email is confidential and may be legally
> privileged. Access to this email by anyone other than the intended
> addressee is unauthorized. If you are not the intended recipient
> of this message, any review, disclosure, copying, distribution,
> retention, or any action taken or omitted to be taken in reliance on
> it is prohibited and may be unlawful. If you are not the intended
> recipient, please reply to or forward a copy of this message to the
> sender and delete the message, any attachments, and any copies thereof
> from your system.

I usually don't reply to personal messages containing these disclaimers,
since I have in fact no way to know if I am the intended recipient,
or if the sender typed my name but was really thinking of someone else.

But on a public list, either this message should be deleted from the
archives or the terms are meaningless.  I tend towards the latter.

If you think it might contain confidential information, don't post
it to a public list.


Liam Quin, W3C XML Activity Lead, http://www.w3.org/People/Quin/


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS