OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Bush Donor Lists in XML

[ Lists Home | Date Index | Thread Index ]
  • From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
  • To: XML-Dev Mailing list <xml-dev@ic.ac.uk>
  • Date: Sun, 12 Sep 1999 17:50:20 -0400

Friday Governor George W. Bush of Texas posted complete records
of his campaign contributions on his web site. However, he
deliberately posted them in PDF format so they couldn't be
imported into a database or a spreadsheet, and consequently
reporters and voters couldn't find out just how much of his
money was coming from whom. Or at least that's what he thought. :-)

I am pleased to announce, that after a few hours of intense
hacking I have succeeded in extracting the crucial information
from the PDF files and have posted them online in XML and tab delimited
formats for anybody who wants them. Accountants,
start your spread sheets!  You'll find the files at

http://metalab.unc.edu/javafaq/bush/

I've written a very simple DTD for the XML version.
<http://metalab.unc.edu/javafaq/bush/donations.dtd> Based on
this DTD the results do appear to be well-formed and valid
(though I've been burned by misbehaving validators before). The
first two validators I tried gave up on trying to parse such a
large (more than eight megabytes) document. Interestingly, the
initial conversion to XML did turn up some bugs in my
PDF-to-text converter program, but the validation of the XML did
not find any additional problems. I can see where a schema
language would be very useful for this sort of reverse
engineering work though.

Eventually I may try to cook up a more serious DTD that more closely
matches the FEC's actual required format for filing electronic copies of
donor lists. I'm also going to try to add a simple XSL stylesheet to these
in the near future, but they're so large that they really challenge anyone
trying to browse them
directly.




+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|                  The XML Bible (IDG Books, 1999)                   |
|              http://metalab.unc.edu/xml/books/bible/               |
|   http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://metalab.unc.edu/javafaq/ |
|  Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/     |
+----------------------------------+---------------------------------+



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)






 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS