xml-dev - Frontline report from the Desperate Fgrep Hacker

Frontline report from the Desperate Fgrep Hacker

[ Lists Home | Date Index | Thread Index ]

From: John Cowan <cowan@locke.ccil.org>
To: xml-dev@xml.org
Date: Thu, 10 Feb 100 22:45:04 -0500 (EST)

I was talking to my boss today about processing a bunch of
XML documents to act on the value of a certain element whose
model is #PCDATA.  I had a long list of 'good' values, perhaps
5000 out of 100,000 possible values, and the question is
"Which documents have good values?"

I happened to know that the element always appeared on a single
line of the file: the start tag, the character data, the end tag.
Furthermore, the content was syntactically distinct from the
rest of the file: it had the form X(X)-NNNNN(X), which did not appea
elsewhere.

I therefore proposed preprocessing the 5000 good values into
elements, and using the GNU "fgrep" program to search the
dcouments for matches.

My boss goggled.  "No XML parser?  Won't they throw you out of the
XML Union for that?"

"Not at all!" said I.  "XML is a data (or document) representation
standard.  It does *not* dictate a particular processing model!
If it's both efficient and (sufficiently) reliable to use
a fast, stupid processing model in a particular case, nothing
in the XML environment prohibits it."

The following shell script (with some decorations) did the trick:

	cp `fgrep -l -f goodvalues.xml *` winners

which copies all files containing any of the values in "goodvalues" to the
subdirectory "winners".  Lightning fast, totally accurate.

Just another Fgrep hacker,

-- 
John Cowan                                   cowan@ccil.org
       I am a member of a civilization. --David Brin

Follow-Ups:
- RE: In Search of XML Interoperability: XLink + XML Schema = Interoperability?
  - From: David Wang <dwang@mitre.org>
- (Pre)Announce: Gutenberg at HWG
  - From: "Frank Boumphrey" <bckman@ix.netcom.com>

Prev by Date: RE: XML Schemas: Needs Marketing?
Next by Date: RE: XML Schemas: Needs Marketing?
Previous by thread: RE: Schema concepts
Next by thread: (Pre)Announce: Gutenberg at HWG
Index(es):
- Date
- Thread