[
Lists Home |
Date Index |
Thread Index
]
On Saturday 26 October 2002 6:19 pm, Paul Prescod wrote:
> > ... It's nice to have a standard way of
> >
> > writing parts of the standard, but you still need to write up a lot of
> > other stuff about the meanings of things and so on.
> So? The fact that there is a formalism for some of it is a clear
> advantage of XML over formats that lack it. XML takes one of the tricky
> parts of the problem of describing and implementing complex,
> hierarchical, recursive data structures and makes it much easier.
I'n the XML vocabulary developments I've seen close up, though (a document
format for describing the setups of computers that is partly auto-generated
and then munged by XSLT to create a giant document describing an entire
network, and the ill-fated XML format for other companies to supply data to
where I current work) the main thrust of development has tended to be in the
sample code that processes the file then the schema updated as an
afterthought!
I guess one thing that bugs me is that a schema might be used to test a bit
of code that writes out documents but not one that reads them. Somebody might
have added a new element and then forgotten to update the schema. At one
point with the data import format somebody had even allowed arbitrary
elements in a certain context - data fields for a record were done with
<fieldname>value</fieldname>, and when we moved away from a fixed data
structure to an editable one in the database you could have any field name
cropping up there and the type of the content would have to match a type
pulled from our database :-/
So I guess I've found them a bit too low level, and it's annoying to have to
develop them seperately to my code; with CORBA IDL and friends I'm defining
my data structure and my interchange format in one fell swoop, so I don't
need to keep them in synch.
And there was really *no* problem with the data structure being the same in
memory as on disk in this case :-) It's a list of records each of which is a
dictionary mapping field names to values.
ImportFile ::= SEQUENCE OF
SEQUENCE OF
SEQUENCE {
name UTF8String,
value UTF8String
}
Hey, there's a point in my position that we don't need harsh seperation
between data interchange format descriptions and in-memory ones; why do we
need a seperate notation for each? It's just a data structure; you still end
up declaring that certain things appear inside certain things and all that;
even if you decide you want different formats internally and externally in a
given situation, it would be nice to have the raw input data coming in as
something compatible with what you process internally, for the simple reason
that your transformation probably doesn't want to rewrite EVERYTHING.
Eg, if a vector image file format was described like so:
VectorImage ::= SEQUENCE OF Polygon
Polygon ::= SEQUENCE OF Point
Point ::= SEQUENCE {
x Integer,
y Integer
}
...I'd rather write:
InMemoryImage load (String filename) {
// parse the file format. The compiler has represented
// SEQUENCE OF types as Java Lists, and the Point type
// as a wee class with methods like getX () and getY ()
VectorImage v = new VectorImage (filename);
// Set up an in memory model. This uses the same structure for the
// polygons as lists of points but the image itself goes into a
// 2D spatial index for faster clipping
InMemoryImage result = new InMemoryImage ();
Iterator polygonIterator = v.iterator ();
while (polygonIterator.hasNext ()) {
Polygon p = (Polygon) polygonIterator.next ();
result.addPolygon (p);
}
return result;
}
...than:
InMemoryImage load (String filename) {
Document doc = MyParser.parse (filename);
// 'scuse syntax, not done any DOM in a while
Element image = doc.getRootElement ();
InMemoryImage result = new InMemoryImage ();
// definitely not DOM but you know what I mean
Iterator pi = image.getChildrenIterator ();
while (pi.hasNext ()) {
Element polyElement = (Element) pi.next ();
InMemoryPolygon polygon = new InMemoryPolygon ();
Iterator pointI = polyElement.getChildrenIterator ();
while (pointI.hasNext ()) {
Element pointElement = (Element) pointI.next ();
try {
polygon.addPoint (new InMemoryPoint (
Long.parseLong (pointElement.getAttribute ("x")),
Long.parseLong (pointElement.getAttribute ("y"))
));
}
catch (NumberFormatException e) {
throw new FileFormatException ("Damned schema violation!");
}
}
result.addPolygon (polygon);
}
return result;
}
> > No way. Have you ever looked at a spec for a binary file format? Most of
> > the ones I've deal with have taken a few hours to bang out an
> > implementation of (except TIFF; implementations of TIFF are never
> > finished...)
>
> For you and me, yes. For the average business programmer? I disagree.
> We're talking about the kind of people who spend most of their day in
> Visual Basic.
But THEY don't even want XML; they probably don't find wandering a DOM tree
any more friendly than calling whatever passes for Perl's "pack" and "unpack"
in VB. They are the people who want to just have magic serialisation from
data structures to strings of bytes.
> > No it's not... I've got quite a few custom protocols I put together
> > lurking around my systems.
>
> If it is running on YOUR SYSTEMS then it isn't deployed in the sense I
> mean. I'm asking have you ever tried to deploy a protocol that would
> have dozens of independent implementations and thousands of users?
> That's _really hard_ and many good protocols never make the leap.
That's purely a problem of adoption in the protocol marketplace, not
difficulty of development.
But I'm a little WG right now developing a protocol to replace IMAP, and I
helped out a bit in developing the PNG image file format (which has dozens of
independent implementations and zillions of users). I wasn't there at the
beginning, but they did pretty much what I'd have done if I was (he says,
modestly).
PNG is a nice example of a file format. It's extensible by third parties to
create their own specialist metadata that can just be ignored by applications
that don't understand it. Better than XML, those extra chunks (sort of like
tags) can be marked with metadata about how applications that don't
understand them should handle them.
1) The chunk might be something like an indication that the image data is
compressed in some bizarre new way. In which case, it is marked so that
applications that don't understand it are forced to reject the file.
2) If not, the chunk still might be something like a thumbnail image or a
histogram of colours used in the image; if a processor changes the image but
doesn't understand this chunk it should remove it since it won't be up to
date if the image is changed
3) Finally, the chunk should just be ignored, and left untouched if the image
is altered. Stuff like copyright notices and so on.
> > ...but I've also produced a few RPC protocols. Let's see if any of them
> > are lying around... hmmm... not handy but take a look in
> > /usr/include/rpcsvc on a Unix box for a few.
>
> I turn off most of the RPC protocols on any boxes I maintain.
They'll still be in there :-)
> > But it's still a list of objects, perhaps with a semantic tree such as an
> > object grouping / contaiment hierarchy and maybe with layers. Your in
> > memory structure *has* to have that or else it's discarding information
> > it'll need when it comes to saving the file again (dedicated readers that
> > know they only need a subset of the information are a different matter,
> > though). It may overlay that tree with a lookup index, but that tree will
> > still be there...
>
> There can be many lossless isomorphic data structures optimized for
> different access modes. You yourself said that Quickbooks could move
> from whatever format it uses today to a SQL/relational representation of
> the same information. Presumably you didn't mean that they should lose
> information.
Nope, because it's the same model still, just implemented differently. From a
linked list of C structs to the result of an SQL "SELECT * FROM
PurchaseOrderLines where poNumber = <foo>" isn't a change of data structure,
just a change of implementation, and indeed in SQL interfaces I've written
for suitable dynamic languages where I can throw together a 'struct' type at
runtime from the result of an SQL query, the linked list and the result set
both support an interface like Java's Iterators since they are the same data
structure.
> > 4) A table of information, SQL stylee
> >
> > There's a fair few indexing schemes that can be applied here, but it's
> > still a table; an ordered multiset of tuples. (a pure relation in the
> > mathematical sense is an unordered set of tuples).
>
> "Table" is not a datastructure. There are many implementations of tables.
There are many implementations of Set, List, etc, too... see java.util for
details :-)
> > Not that much information is hierarchical, certainly by bulk...
>
> So you're saying that you have information that XML isn't really
> designed for and XML isn't really that helpful for it. Are you arguing
> just for the pleasure of it?
No, I'm arguing against the fact that you're saying that XML is suitable for
far more than I think it is!
Just to reverse positions, I see XML as useful for marking up text... but
it's not well fitted for data. As far as I can tell there's been a conceptual
bleed from:
This is <emphasis>tasty</emphasis> cheese
(taking a string of text then slipping a few tags in here and there to add
abstract style information)
to:
<document>
<title>Alaric's Cheese Bible</title>
<para>...</para>
</document>
(extending that to expressing metadata, but still expressing the metadata in
terms of abstract styles; remove all the tags and you get:
Alaric's Cheese Bible
...paragraphs...
<title> is really just an abstract style with a schema constraint added that
you can only have one, and as the first thing in a document).
to:
<cheese>
<name>Cheddar</name>
<colour>Yellowish</colour>
<price currency="UKP" unit="kg">2.50</price>
</cheese>
...without enough stopping to think if it's a good idea.
I'm not even sure if the latter was what the W3C intended when coming up with
XML; the introduction at http://www.w3c.org/XML/ states that it was designed
for publishing, not data transfer, but it's becoming used for data transfer
anyway.
Is this just people noticing that something should be used for more than it
was intended for (which I'm suspicious enough of :-), or is it people
misapplying something out of foolish over optimism?
Whose idea *was* it to use XML for data interchange? The W3C seems to disavow
responsibility in the first paragraph of that introduction. But somebody
somehwere made a mental leap from "styling a human-readable document" to
"data transfer". There are gray areas between the two, since an invoice might
well be considered to need to be both a readable document and a piece of
data, but nobody seems to be putting <?xml-stylesheet?> PIs in their XML
purchase orders, do they?
> > email,faveFoods
> > "alaric@alaric-snell.com","Cheese"+"Yoghurt"+"Pizza"
>
> And what about recursive hiearchies? You can keep hacking the CSV format
> but eventually you'll reach a point of diminishing returns.
Some might argue that XML is also being hacked to a point of diminishing
returns :-)
> Paul Prescod
ABS
--
Oh, pilot of the storm who leaves no trace, Like thoughts inside a dream
Heed the path that led me to that place, Yellow desert screen
|