OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Processing huge XML files

[ Lists Home | Date Index | Thread Index ]

Hi all,

Really thanks for your valuable advice. Let me give you more info for my 
case. In fact, we are required to access the different parsed data 
values in the file at high performance although we know the access 
patterns for our specific application. (I mean the access is not totally 
random.) So it's good to have an efficient persistent data structure for 
the parsed XML data file. At best, the data structure is generic (to the 
XML schema and access patterns) enough to support fast data access. But 
at least, we are looking for a method to implement a data structure 
customized for a specific XML schema and the defined access pattern. I'm 
looking at different technologies that some of you have suggested. Other 
suggestions are most welcome.

Thanks again,

Rick Jelliffe wrote:
> From: "Michael Kay" <michael.h.kay@ntlworld.com>
>>But really, when you get above 50Mb or so, you need to start looking at
>>XML databases. 
> Another approach is to use steaming languages such as Perl and OmniMark,
> (and, I guess, Python?) especially if you are not updating the data just extracting information.
> Of course, you may need to take several passes.  And you may need to
> have one pass of the data generate a program to be used for then next
> pass, a venerable technique that is often overlooked.  But multiple
> passes with streaming languages is the way that many large scale
> publishing systems work.  A lot can depend on whether your document
> has an order that is amenable to your application: storing metadata
> and keys before the data in particular. 
> A very typical way of constructing streaming programs on large 
> data sets is to do two passes:
>   1) Run over the data and extract all information that will be needed for 
>     decisions that otherwise require random access or lookahead.
>   2) Run over the data and perform the extractions/analysis, using the
>     decision points. 
> Cheers
> Rick Jelliffe
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> The list archives are at http://lists.xml.org/archives/xml-dev/
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>

   Thomas Y.T. LEE
   Chief Technology Officer
   Center for E-Commerce Infrastructure Development (CECID)
   Department of Computer Science and Information Systems
   The University of Hong Kong
   E-mail: ytlee@cecid.hku.hk  URL: http://www.cecid.hku.hk
   Tel: +852 22415388  Fax: +852 25474611
   Room 301, Chow Yei Ching Building
   Pokfulam Road, Hong Kong SAR, China


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS