Lists Home |
Date Index |
Elliotte Rusty Harold wrote:
> But rigid fixed schemas fail when we're talking about thousands or tens
> of thousands or even millions of disconnected developers who do not have
> prior agreements, who do not know each other, and who are doing very
> different things with the same data. This is the world of the Internet.
> This is the world I work in. This is the world more and more developers
> are working in more and more of the time, and the old practices that
> worked in small, closed systems behind the firewall are failing. It's
> time to learn how to design systems that are flexible and loosely
> coupled enough to work in this new environment. XML is a critical
> component in making this work. Maybe RDF is too, though I'm still not
> convinced (to bring this thread back on topic.) Schemas really aren't.
> At best schemas are a useful diagnostic tool for deciding what kind of
> document you've got so you can dispatch it to the appropriate local
> process. At worst, however, schemas encourage a mindset and assumptions
> that are actively harmful when trying to produce scalable, robust,
> interoperable systems.
What Rusty said.
Here are two vingettes from my own experience to underline his point.
- We will be getting xml messages (via JMS) from a state agency - the
state of California, in fact. Their contractor tells us the messages
conform to such-and-such a schema. The schema happens to be one that we
ourselves wrote; it is a draft version of a to-be standard.
But the first documents we get do not validate against the schema, and
unfortunately they are not just simple extensions. In a few places new
structures have made their way into the document. It seems pretty clear
what has happened. Probably the messages originally validated, but then
the contractor found they wanted to make some changes and forgot that
the changes might not be schema-valid. Or maybe they never tried
validating in the first place. Anyway, no problem - xslt to the rescue!
- I need to screen-scrape certain data from a web page updated from time
to time. The page is put up by a US government agency. The data is
critical medically-related information. The results of the data
extraction go into the front end of a long and complex automated
workflow. I write the front-end parser (this was before John Cowan's
tag soup parser came out).
It turns out that the page is hand-authored by someone who is not very
expert about html. Every update the internal structure changes. It
always looks the same in the browser, but certain key internal parts are
actually invalid html, and the nature of the invalidity changes each
time. Unfortunately we have to use those parts to extract indexes that
point to the actual data we want to collect from other parts of the page.
We cannot outguess all the changes, and so from time to time we get
parse failures. We cannot influence the page design. Finally, we give
up and use the text-only version that the agency also hosts. This has
no markup, but the visual structure blocks out the information we need
in a consistent way, and the visual structure matches the actual text
format. I write a parser that emits sax-like events to feed into the
downstream process. Everything works nicely and robustly after this change.
As Rusty says, that is the world of the internet.
Thomas B. Passin
Explorer's Guide the the Semantic Web (Manning Books)