Lists Home |
Date Index |
- From: John Cowan <firstname.lastname@example.org>
- To: email@example.com
- Date: Thu, 10 Feb 100 22:45:04 -0500 (EST)
I was talking to my boss today about processing a bunch of
XML documents to act on the value of a certain element whose
model is #PCDATA. I had a long list of 'good' values, perhaps
5000 out of 100,000 possible values, and the question is
"Which documents have good values?"
I happened to know that the element always appeared on a single
line of the file: the start tag, the character data, the end tag.
Furthermore, the content was syntactically distinct from the
rest of the file: it had the form X(X)-NNNNN(X), which did not appea
I therefore proposed preprocessing the 5000 good values into
elements, and using the GNU "fgrep" program to search the
dcouments for matches.
My boss goggled. "No XML parser? Won't they throw you out of the
XML Union for that?"
"Not at all!" said I. "XML is a data (or document) representation
standard. It does *not* dictate a particular processing model!
If it's both efficient and (sufficiently) reliable to use
a fast, stupid processing model in a particular case, nothing
in the XML environment prohibits it."
The following shell script (with some decorations) did the trick:
cp `fgrep -l -f goodvalues.xml *` winners
which copies all files containing any of the values in "goodvalues" to the
subdirectory "winners". Lightning fast, totally accurate.
Just another Fgrep hacker,
John Cowan firstname.lastname@example.org
I am a member of a civilization. --David Brin