OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Frontline report from the Desperate Fgrep Hacker

[ Lists Home | Date Index | Thread Index ]
  • From: John Cowan <cowan@locke.ccil.org>
  • To: xml-dev@xml.org
  • Date: Thu, 10 Feb 100 22:45:04 -0500 (EST)

I was talking to my boss today about processing a bunch of
XML documents to act on the value of a certain element whose
model is #PCDATA.  I had a long list of 'good' values, perhaps
5000 out of 100,000 possible values, and the question is
"Which documents have good values?"

I happened to know that the element always appeared on a single
line of the file: the start tag, the character data, the end tag.
Furthermore, the content was syntactically distinct from the
rest of the file: it had the form X(X)-NNNNN(X), which did not appea
elsewhere.

I therefore proposed preprocessing the 5000 good values into
elements, and using the GNU "fgrep" program to search the
dcouments for matches.

My boss goggled.  "No XML parser?  Won't they throw you out of the
XML Union for that?"

"Not at all!" said I.  "XML is a data (or document) representation
standard.  It does *not* dictate a particular processing model!
If it's both efficient and (sufficiently) reliable to use
a fast, stupid processing model in a particular case, nothing
in the XML environment prohibits it."

The following shell script (with some decorations) did the trick:

	cp `fgrep -l -f goodvalues.xml *` winners

which copies all files containing any of the values in "goodvalues" to the
subdirectory "winners".  Lightning fast, totally accurate.

Just another Fgrep hacker,

-- 
John Cowan                                   cowan@ccil.org
       I am a member of a civilization. --David Brin




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS