OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: XML tools and big documents

[ Lists Home | Date Index | Thread Index ]
  • From: ht@cogsci.ed.ac.uk (Henry S. Thompson)
  • To: xml-dev@ic.ac.uk
  • Date: 03 Sep 1998 17:07:55 +0100

Nigel Kerr <nigelk@umich.edu> writes:

> 	"what's the most immediate containing element of offset X in
> 	file Y?"
> 
> 	"traverse up the logical structure from offset X until a DIV
> 	element with a HEAD is found, and return me the offsets of
> 	that HEAD"
> 
> Exact expression language is, uh, gee.  These are the kinds of
> questions we could ask with "some XML query language", but if i have a
> gigabyte or so of variously-structured English text marked up this
> way, i really don't want to have to parse the document entity just to
> answer these kinds of simple questions.  This is a weak specification
> of what I'm trying to do, i realize.  (this all largely because i am

Our LT XML tool set and API were designed for precisely this sort of
application (we regularly work with >1GB language SGML-encoded corpora
such as the BNC).  We get good performance because

1) Our parser is written in C, our search and retrieval tools use it
   directly via a stream-based API, only custom UI tends to get
   written in a scripting language which looks at whole trees;

2) We only produce tree fragments when we get to the interesting bits:
   our query processor is optimised to avoid building large amounts of
   tree unnecessarily;

3) For REALLY big datasets, we do produce and use offset-based
   indices.

For more information, see http://www.ltg.ed.ac.uk/software/xml/.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS