Lists Home |
Date Index |
>> >1) Writing CSV code is easier than XML code (no DOM or anything, just
>> >something like SAX; I write the CSV parser myself in less code than it
>> >takes to interface with an XML parser)
>> If DOM is too hard (and mostly I agree with you) use SAX or use JDOM.
>> JDOM certainly is much simpler for the sorts of things you're doing.
Ok, if the data model is flat, then SAX gets easier than DOM (and is
probably a lot more appropriate) - but I bet I've seen 2:1 posts of people
finding SAX difficult:finding DOM difficult (the handler calling appears
back to front, whereas a tree is just a tree...).
It depends somewhat on your data source - can you specify the format in
which the data is generated?
>> DOM != XML
Well yes, but if you just say XML = http://www.w3.org/TR/REC-xml then you
lose a lot. (BTW, does XML = Infoset yet?)
>The code I have that does XML data import (to the same engine as my CVS
>data import) uses SAX, yes, but it's still bigger since it then needs to
>implement a state machine to pull apart the table structure from a tree.
Surely the CSV parser needs to know when it gets to the end of a row?
Doesn't that have to deal with exactly the same kind of states?
>> >2) Data corruption? XML parsers are *fragile*, CSV parsers can
>> >with erronious data in ways that XML parsers mustn't if they are to be
>> >standards compliant!
>> That's a feature, not a bug. If the data is bad, I want to know about
>> it ASAP and get it fixed at the source. Draconian error handling is a
>> very good thing.
>Depends if you're working in a world of potentially dodgy data sources...
Well, that's air traffic control for you...
>I'd rather not *know* if data is bad, I'd rather the system transparently
>fixed it, and only told me if it's too bad to properly process.
XML parsers are not usually fragile - faced with bad data, they let you
Horses for courses, of course - where you draw the line on 'too bad' - all
data sources are potentially dodgy, and it's easy enough to express junk in
well-formed, valid XML. For most practical purposes more draconian measures
definitely make life easier, because you get a clearer signal (which you can
always handle in a pragmatic fashion).
>With my CSVs, if one row is missing a field or has an extra field (so the
>CSV is not well formed, eg not all the rows are the same length) or if
>there's a field name that I do not recognise, then I signal that as an
>error and stop.
>But if they've just used a strange date format, as long as it's parsable,
>I'd rather be able to study it and then add support for that date format
>so it's not an error in future than have it be forced as an error by some
Surely the same applies to SAX?
This situation is ok as long you are in a position to offer human
surveillance, and don't have to justify (or even estimate) the accuracy of
your data. These are exceptional circumstances!
Personally I'd be tempted to opt for an XML solution on the grounds of
interoperability (and I'm sure it wouldn't involve much more work that CSV
building/parsing), and adopting a standard would also help in the tribunals.
<stuff> http://www.isacat.net </stuff>