[
Lists Home |
Date Index |
Thread Index
]
On Monday 28 October 2002 4:12 pm, Paul Prescod wrote:
> Alaric B. Snell wrote:
> >...
> >
> >
> > Because XML has a fragile data model, designed for publishing stuff to a
> > browser rather than transfer between applications?
>
> XML is based on SGML which was invented long before browsers as we know
> them.
Yep, but that's orthogonal to my point; XML is based on ASCII, which is based
on binary, which has been around since (potentially) ancient times in China
(there being some evidence of binary arithmetic in certain ancient Chinese
symbolism).
XML was designed, sometime in the late 90s; quite incidentally they decided
to subset SGML because it seemed a good base for solving the problem at hand
(and scarily enough, I agree with them! XML's great as an SGML-lite :-)
[why XML for data]
> It arises naturally from the observation that structured data (tuple
> structured, hieararchically structured, graph structured, recursive) is
> a subset of the kinds of data you will find in the documents XML was
> designed to handle. A telephone book is tuple-structured. An airplane
> manual is mostly hierarchically structured but with frequent escapes to
> graph structure.
But for complex data interchange the hierarchy of XML can be limiting. XML
deals with lists OK since a list is a subset of a tree - each node in a tree
is a list of nodes.
But for tables, it's clunky because you're having lots of nodes with the same
structure under a parent node. Repeating that structure gets laborious to
type if you're a human and is laborious to process if you're a computer. For
a table of tuples, it's much easier for all parties to deal with:
email,name
alaric@alaric-snell.com,Alaric Snell
paul@prescod.net,Paul Prescod
foo@bar.com,"Comma Containing, Mrs"
...than with the XML, which is at best:
<table>
<field email="alaric@alaric-snell.com" name="Alaric Snell" />
<field email=""paul@prescod.net" name="Paul Prescod" />
<field email="foo@bar.com" name="Ampersand containing, Mr & Mrs" />
</table>
...and of course for graphs you need a system of primary keys and pointers
like id= - and you have to build your graph over a hierarchy which isn't
always optimal; at worst you use the hierarchy to build a list of key-value
pairs, then use the list of key-value pairs with 'reference' nodes in the
values to build a graph :-/
Or how about a multiple-parent hierarchy, hmm? A family tree?
Not so bad in CSV:
name,mother,father
Alaric Snell,Karin Owens,Lionel Snell
Karin Owens,Jean Byrne,James Owens
Lionel Snell,moment of shame as I forget my grandparent's names...
...
or
id,name,motherId,fatherId
1,Alaric Snell,2,3
2,Karin Owens,..,..
3,Lionel Snell,..,..
...
if name clashes are a concern!
Now, of course, the counter argument is that you put up with XML being clunky
at tuples or graphs because it's good at hierarchies and lists and bearable
at tuples and graphs so you overally have something that can kind of manage
everything... XML got in the door by also being useful for a data structure
generally ignored by the 'data' crowd, "text with metadata wrapped around
spans of it".
In a world where most information (by bulk) is table-shaped, the most
interesting information is graph-shaped, and merely the most fashionable
information (I'm talking the WWW in general here, not just XML) is
tree-of-annotated-text shaped, what kind of data transfer system do I
want? One that gets out the way, and just gets my information from A to B
with the minimum of effort (even if it is a pair of superimposed labelled
directed acyclic graphs because I'm comparing the dataflow and control flow
behaviour of a piece of code!).
Many years ago I wrote a document processing system - it produced FAQ lists
in ten or so different formats, postcript / dvi / pdf / plain html / html
with navigation gadgets in it / html as one big file / info / nicely
formatted plain text - and a few other little ones.
As an input format, I used s-expressions. s-expressions aren't as simple as
XML for attaching styles to text; instead of:
<document>
<title>Hello World</title>
<p>This is a <b>nice</b> document with <i>many</i> styles</p>
</document>
...you did:
(document
title: "Hello World"
(p "This is a " (b "nice") " document with " (i "many") " styles")
)
Yep, those quotes are a bit irritating, aren't they? So I stuck M4 (a macro
processing engine) in front and wrote a macro called <b> that expanded to "
(b " and so on, so I could write:
(document
title: "Hello World"
(p "This is a <b>nice</b> document with <i>many</i> styles")
)
Inspired by HTML, you see. But I still preferred the s-expressions for the
non mixed content parts.
And kapow! I had a slight modification of s-expression syntax being my text
input format. For completeness, here's a plain list in s-expr:
(1 2 3 4)
And here's a dictionary, in the case where you have a fixed set of keys to
choose from:
(name: "Alaric Snell" email: "alaric@alaric-snell.com"))
And for arbitrary keys:
(("name" "Alaric Snell")
("email" "alaric@alaric-snell.com"))
And here's a tree:
(html
(head
(title "My Document"))
(body
(p "This is an HTML document, in an alien form")))
And here's a table:
(table ('name 'email)
("Alaric Snell" "alaric@alaric-snell.com")
("Paul Prescod" "paul@prescod.net"))
Now s-expresions look pretty hierarchical here, but that's just a shorthand;
they're really full graphs. I can't remember the graph syntax since I rarely
used it but basically you can name any node with a bit of attached metadata
and then say "Insert a reference to node X here" where you need it to create
links. The link is explicit in the syntax, not implicit like with XML. The
names you use for nodes aren't part of the data being transferred, they're
just used for the transfer itself (like the choice of a namespace prefix in
an XML document). It looked something like:
(family-tree
(name: "Alaric" mother: @karin father: @lionel)
@karin:
(name: "Karin" ...)
@lionel:
(name: Lionel ...)
...
)
You can create a cyclic structure like so:
@cycle: ("Here is a cycle: " @cycle)
That's a list whose first element is a string and whose second element is
itself.
Those closing brackets can be awkward since you don't know what they match up
to, like minimized close tags in SGML - I long ago solved that for my
problems although standard s-expressions don't do this; my s-expression
parsing library supports a second syntax:
[foo[
...
]foo]
is syntactic sugar for:
(foo ...)
Now, if the people who came up with XML as data had wandered across a nest of
Lisp hackers instead of XML, might we not have seen something like my
s-expression variant with the symmetric tags being produced, perhaps
augmented by a syntax to embed nodes in text strings as I hacked together
with a macro system, then namespaces for symbols defined, then a
transformation language and a path language and so on defined?
Perhaps they'd been put off by DSSSL :-)
> There is no boundary between data and documents but of
> course there may be a point on the spectrum where XML produces small
> benefit (e.g. if CSV is all you need).
Yep. I just think we could have come up with a system that provides a better
overall gain for humanity - and has less areas where it produces small
benefit. The s-expression corresponding to a CSV file is little more complex
than the CSV would be, so it's already less of a loss than using XML in that
situation.
XML's gained a following, but the hype is waning already. It's lodged in many
niches but it's failed to change The Web as I see it, what it was originally
designed for... although XML with embedded stylesheets being served over HTTP
and displayed by browsers would have been a better Web the improvement is
justifiably marginal compared to the costs to vested interests, so I'm not
too surprised.
If Web services really come to pass in a big way, it will be more despite XML
than because of it. If they'd been based on ONC RPC (remember, kids: it's
extensible, copes quite happily with revisions to the standards in use,
loosely coupled, easy to use, quite happy with lossy networks and so on, not
a problem to debug in practice, an established standard, widely implemented,
and near-optimally efficient in time and space), what would be the problem?
Currently my home machine is mounting Sunsite over the Global Interweb (tm)
with NFS, which uses ONC RPC underneath. This is cool because I can do:
alaric@hate:/net/sunsite$ ls
0-Most-Packages geography lost+found pub recreation usenet
aminet geology media public rfc usr
bin gnu Mirrors README science var
biology IAFA-SITEINFO misc README.ftp special
computing ic.doc packages README.layout sun
core incoming park README.login tmp
etc info politics README.uploads unix
Now I don't think that would be *practical* with SOAP over HTTP! My ls
request ran over UDP packets with sizes of 100-200 bytes - one packet to
request, another packet to respond, on three-way handshakes or teardown and
no TCP overhead wasting time reordering the response packets to make them
come in the order the server sent (and I can submit requests without needing
to wait for the response to the last request, which I don't think HTTP 1.1
allows but I may be wrong).
The ONC RPC data model, XDR, is limiting in some ways but they could have
easily have spent the time spent making SOAP on instead rewriting it in terms
of the s-expression data interchange theye made up in the time they would
have spent making XML, particularly if they'd spent some of the spare time
making an alternative compact binary syntax for the sexprs (before you ask,
one that's semantically identical to the textual sexprs and can be converted
to and from same with a single, very simple, tool: output (SYNTAX_TEXT, parse
(SYNTAX_BINARY,file) or vice versa)
I guess that's part of what makes my blood boil about XML data and XML for
interprocess communication and all that; the reinvention of wheels, with less
care than the first time round :-(
Whew... what a lot of typing! I should be eating!
> Paul Prescod
ABS
--
Oh, pilot of the storm who leaves no trace, Like thoughts inside a dream
Heed the path that led me to that place, Yellow desert screen
|