OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] What is XML For?

[ Lists Home | Date Index | Thread Index ]

On Saturday 26 October 2002 6:19 pm, Paul Prescod wrote:
> > ... It's nice to have a standard way of
> >
> > writing parts of the standard, but you still need to write up a lot of
> > other stuff about the meanings of things and so on.

> So? The fact that there is a formalism for some of it is a clear
> advantage of XML over formats that lack it. XML takes one of the tricky
> parts of the problem of describing and implementing complex,
> hierarchical, recursive data structures and makes it much easier.

I'n the XML vocabulary developments I've seen close up, though (a document 
format for describing the setups of computers that is partly auto-generated 
and then munged by XSLT to create a giant document describing an entire 
network, and the ill-fated XML format for other companies to supply data to 
where I current work) the main thrust of development has tended to be in the 
sample code that processes the file then the schema updated as an 

I guess one thing that bugs me is that a schema might be used to test a bit 
of code that writes out documents but not one that reads them. Somebody might 
have added a new element and then forgotten to update the schema. At one 
point with the data import format somebody had even allowed arbitrary 
elements in a certain context - data fields for a record were done with 
<fieldname>value</fieldname>, and when we moved away from a fixed data 
structure to an editable one in the database you could have any field name 
cropping up there and the type of the content would have to match a type 
pulled from our database :-/

So I guess I've found them a bit too low level, and it's annoying to have to 
develop them seperately to my code; with CORBA IDL and friends I'm defining 
my data structure and my interchange format in one fell swoop, so I don't 
need to keep them in synch.

And there was really *no* problem with the data structure being the same in 
memory as on disk in this case :-) It's a list of records each of which is a 
dictionary mapping field names to values.

ImportFile ::= SEQUENCE OF
                 SEQUENCE OF
                   SEQUENCE {
                      name UTF8String,
                      value UTF8String

Hey, there's a point in my position that we don't need harsh seperation 
between data interchange format descriptions and in-memory ones; why do we 
need a seperate notation for each? It's just a data structure; you still end 
up declaring that certain things appear inside certain things and all that; 
even if you decide you want different formats internally and externally in a 
given situation, it would be nice to have the raw input data coming in as 
something compatible with what you process internally, for the simple reason 
that your transformation probably doesn't want to rewrite EVERYTHING.

Eg, if a vector image file format was described like so:

VectorImage ::= SEQUENCE OF Polygon

Polygon ::= SEQUENCE OF Point

Point ::= SEQUENCE {
            x Integer,
            y Integer

...I'd rather write:

InMemoryImage load (String filename) {
      // parse the file format. The compiler has represented
      // SEQUENCE OF types as Java Lists, and the Point type
      // as a wee class with methods like getX () and getY ()
     VectorImage v = new VectorImage (filename);

     // Set up an in memory model. This uses the same structure for the
     // polygons as lists of points but the image itself goes into a
     // 2D spatial index for faster clipping
     InMemoryImage result = new InMemoryImage ();

     Iterator polygonIterator = v.iterator ();
     while (polygonIterator.hasNext ()) {
       Polygon p = (Polygon) polygonIterator.next ();
       result.addPolygon (p);

     return result;


InMemoryImage load (String filename) {
    Document doc = MyParser.parse (filename);
    // 'scuse syntax, not done any DOM in a while
    Element image = doc.getRootElement ();
    InMemoryImage result = new InMemoryImage ();

    // definitely not DOM but you know what I mean
    Iterator pi = image.getChildrenIterator ();
    while (pi.hasNext ()) {
        Element polyElement = (Element) pi.next ();
        InMemoryPolygon polygon = new InMemoryPolygon ();
        Iterator pointI = polyElement.getChildrenIterator ();
        while (pointI.hasNext ()) {
            Element pointElement = (Element) pointI.next ();
            try {
	            polygon.addPoint (new InMemoryPoint (
	                  Long.parseLong (pointElement.getAttribute ("x")),
	                  Long.parseLong (pointElement.getAttribute ("y"))
             catch (NumberFormatException e) {
	          throw new FileFormatException ("Damned schema violation!");
        result.addPolygon (polygon);        

    return result;

> > No way. Have you ever looked at a spec for a binary file format? Most of
> > the ones I've deal with have taken a few hours to bang out an
> > implementation of (except TIFF; implementations of TIFF are never
> > finished...)
> For you and me, yes. For the average business programmer? I disagree.
> We're talking about the kind of people who spend most of their day in
> Visual Basic.

But THEY don't even want XML; they probably don't find wandering a DOM tree 
any more friendly than calling whatever passes for Perl's "pack" and "unpack" 
in VB. They are the people who want to just have magic serialisation from 
data structures to strings of bytes.

> > No it's not... I've got quite a few custom protocols I put together
> > lurking around my systems.
> If it is running on YOUR SYSTEMS then it isn't deployed in the sense I
> mean. I'm asking have you ever tried to deploy a protocol that would
> have dozens of independent implementations and thousands of users?
> That's _really hard_ and many good protocols never make the leap.

That's purely a problem of adoption in the protocol marketplace, not 
difficulty of development.

But I'm a little WG right now developing a protocol to replace IMAP, and I 
helped out a bit in developing the PNG image file format (which has dozens of 
independent implementations and zillions of users). I wasn't there at the 
beginning, but they did pretty much what I'd have done if I was (he says, 

PNG is a nice example of a file format. It's extensible by third parties to 
create their own specialist metadata that can just be ignored by applications 
that don't understand it. Better than XML, those extra chunks (sort of like 
tags) can be marked with metadata about how applications that don't 
understand them should handle them.

1) The chunk might be something like an indication that the image data is 
compressed in some bizarre new way. In which case, it is marked so that 
applications that don't understand it are forced to reject the file.

2) If not, the chunk still might be something like a thumbnail image or a 
histogram of colours used in the image; if a processor changes the image but 
doesn't understand this chunk it should remove it since it won't be up to 
date if the image is changed

3) Finally, the chunk should just be ignored, and left untouched if the image 
is altered. Stuff like copyright notices and so on.

> > ...but I've also produced a few RPC protocols. Let's see if any of them
> > are lying around... hmmm... not handy but take a look in
> > /usr/include/rpcsvc on a Unix box for a few.
> I turn off most of the RPC protocols on any boxes I maintain.

They'll still be in there :-)

> > But it's still a list of objects, perhaps with a semantic tree such as an
> > object grouping / contaiment hierarchy and maybe with layers. Your in
> > memory structure *has* to have that or else it's discarding information
> > it'll need when it comes to saving the file again (dedicated readers that
> > know they only need a subset of the information are a different matter,
> > though). It may overlay that tree with a lookup index, but that tree will
> > still be there...
> There can be many lossless isomorphic data structures optimized for
> different access modes. You yourself said that Quickbooks could move
> from whatever format it uses today to a SQL/relational representation of
> the same information. Presumably you didn't mean that they should lose
> information.

Nope, because it's the same model still, just implemented differently. From a 
linked list of C structs to the result of an SQL "SELECT * FROM 
PurchaseOrderLines where poNumber = <foo>" isn't a change of data structure, 
just a change of implementation, and indeed in SQL interfaces I've written 
for suitable dynamic languages where I can throw together a 'struct' type at 
runtime from the result of an SQL query, the linked list and the result set 
both support an interface like Java's Iterators since they are the same data 

> > 4) A table of information, SQL stylee
> >
> > There's a fair few indexing schemes that can be applied here, but it's
> > still a table; an ordered multiset of tuples. (a pure relation in the
> > mathematical sense is an unordered set of tuples).
> "Table" is not a datastructure. There are many implementations of tables.

There are many implementations of Set, List, etc, too... see java.util for 
details :-)

> > Not that much information is hierarchical, certainly by bulk...
> So you're saying that you have information that XML isn't really
> designed for and XML isn't really that helpful for it. Are you arguing
> just for the pleasure of it?

No, I'm arguing against the fact that you're saying that XML is suitable for 
far more than I think it is!

Just to reverse positions, I see XML as useful for marking up text... but 
it's not well fitted for data. As far as I can tell there's been a conceptual 
bleed from:

This is <emphasis>tasty</emphasis> cheese

(taking a string of text then slipping a few tags in here and there to add 
abstract style information)


  <title>Alaric's Cheese Bible</title>

(extending that to expressing metadata, but still expressing the metadata in 
terms of abstract styles; remove all the tags and you get:

Alaric's Cheese Bible


<title> is really just an abstract style with a schema constraint added that 
you can only have one, and as the first thing in a document).


  <price currency="UKP" unit="kg">2.50</price>

...without enough stopping to think if it's a good idea.

I'm not even sure if the latter was what the W3C intended when coming up with 
XML; the introduction at http://www.w3c.org/XML/ states that it was designed 
for publishing, not data transfer, but it's becoming used for data transfer 

Is this just people noticing that something should be used for more than it 
was intended for (which I'm suspicious enough of :-), or is it people 
misapplying something out of foolish over optimism?

Whose idea *was* it to use XML for data interchange? The W3C seems to disavow 
responsibility in the first paragraph of that introduction. But somebody 
somehwere made a mental leap from "styling a human-readable document" to 
"data transfer". There are gray areas between the two, since an invoice might 
well be considered to need to be both a readable document and a piece of 
data, but nobody seems to be putting <?xml-stylesheet?> PIs in their XML 
purchase orders, do they?

> > email,faveFoods
> > "alaric@alaric-snell.com","Cheese"+"Yoghurt"+"Pizza"
> And what about recursive hiearchies? You can keep hacking the CSV format
> but eventually you'll reach a point of diminishing returns.

Some might argue that XML is also being hacked to a point of diminishing 
returns :-)

> Paul Prescod


Oh, pilot of the storm who leaves no trace, Like thoughts inside a dream
Heed the path that led me to that place, Yellow desert screen


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS