Lists Home |
Date Index |
On Fri, 25 Oct 2002, Paul Prescod wrote:
> >><!ELEMENT purchaseOrder (buyer, seller, ...)>
> > That agrees nothing!
> It agrees that the buyer precedes the seller and both go within the
It *states* that, doesn't meen anyone *agrees* :-)
But it doesn't say whether the buyer or seller are denoted as URIs,
numeric ids, or string names, or whatever... the point I'm getting at is
that you need more than just a DTD. It's nice to have a standard way of
writing parts of the standard, but you still need to write up a lot of
other stuff about the meanings of things and so on.
> > Do you deny that groups of people get together to produce things like XHTML,
> > and vertical industry message formats? Because they do. The presence of DTDs
> > and schemas doesn't remove this requirement.
> DTDs and schemas give you a structure for _expressing_ your agreement in
> a human and machine-readable manner.
Yep! And that's all there is to it.
> The difference between
> a) installing a schema and reading its surrounding semantic
> documenation and
> b) reading a spec for a binary file format is massive.
> So massive, in fact, that the XML project is within the means of the
> average business programmer and the binary project is not.
No way. Have you ever looked at a spec for a binary file format? Most of
the ones I've deal with have taken a few hours to bang out an
implementation of (except TIFF; implementations of TIFF are never
> > The Internet protocols don't have a formal schema notation, they're just
> > defined in English in RFCs. And they're more widespread than XML, I reckon;
> > it hasn't harmed them, has it?
> Have you ever tried to deploy a new Internet protocol?
Actually, yes :-)
> It is near impossible. That's why there are so few widely deployed ones.
No it's not... I've got quite a few custom protocols I put together
lurking around my systems. The implementation of the latest one is quite
SERVER - run from a cronjob at x pm
pg_dump <details of database> | mcrypt -e <key> | nc -l -p xxxx
CLIENT - run from a cronjob at x:05 pm (to allow for clock skew)
nc -p xxxx server | mcrypt -d <key> > database.dump
...but I've also produced a few RPC protocols. Let's see if any of them
are lying around... hmmm... not handy but take a look in
/usr/include/rpcsvc on a Unix box for a few. I've also put together a
replacement for RMI that's a bit less tightly bound (the default RMI
implementation is somewhat fragile!).
> Now compare that effort to deploying a new XML vocabulary. Sure, non-XML
> formats can become popular, as informally specified protocols can become
> popular. The question is how much effort it takes. This effort greatly
> impacts the _likelihood_ of the format/protocol gaining popularity.
It's not that hard :-) Try it!
> > Stop and think about that. What differentiates these products, hmm? Do I buy
> > Corel because it uses a different in-memory data structure to xfig?
> Insofar as people buy products in part for their performance, the answer
> is definately YES. In particular, I find it absurd that you would argue
> that SAP and Quickbooks should use the same data structures despite the
> fact that one runs relational-backed enterprises and the other
> Windows-hosted small businesses. SAP _could not_ get away with using the
> same datastructures that QuickBooks does.
Why not? QB could embed a small SQL server - you can get in-memory SQL
servers - and use an identical table layout... if there are enough
differences between the small business and large business *models* then
you're comparing apples and oranges anyway.
> > Yep, just because I know more about bitmap file formats - overall, we are
> > discussing data interchange in general; you brought up vector files as an
> > example, I bring up bitmap files.
> You well know that almost nobody proposes to use XML for bitmaps.
No, but that's not the point! It's just an area of file formats that I
happen to know lots about, having implemented most of the common ones.
> Furthermore, vector graphics provide many more opportunities for
> optimization based on intelligent choice of data structures. I have a
> friend who built a commercially successful graphics program around a
> _single_ proprietary vector graphics algorithm/datastructure pair. It
> allowed certain kinds of scaling that were impossible with the more
> traditional algorithms. And in fact this is a very common case in the 3D
> graphics world.
But it's still a list of objects, perhaps with a semantic tree such as an
object grouping / contaiment hierarchy and maybe with layers. Your in
memory structure *has* to have that or else it's discarding information
it'll need when it comes to saving the file again (dedicated readers that
know they only need a subset of the information are a different matter,
though). It may overlay that tree with a lookup index, but that tree will
still be there...
> > "...into memory as a character array..."
> > Not a *byte* array!
> So the on-disk representation is _different than_ the in-memory
Depends what level you look at - sure, most machines use magnetic disks as
opposed to electronic memories these days :-) But in Java it's all just
characters to the programmer's level. This is drifting off of the original
point, though; I was not aiming at *bit* equivelance but *structure*
equivelance. You were arguing against things that automatically map from
your XML or binary data to, say, Java objects since you thought you'd want
a different structure in memory as on disk; I maintain that you rarely if
ever want to do more than add extra indexing.
Let's look at some examples.
1) Vector graphics, although not my speciality
What is the data model? Usually some kind of space partition tree with the
leaf nodes being one of a set of primitives. The tree can either be
semantically based - the grouping of objects into larger objects - or a
more rigid thing like a BSP tree or an octree that is used to optimise
certain lookup operations. In the former case you'll need that structure
to exist both on disk and in memory since otherwise
semantically meaningful information will be
lost. In memory you might add a lookup table from object ID to the actual
object to avoid having to walk the tree to find arbitrary objects,
2) Plain text
The only model for this other than a list of characters that I've seen is
a list of lines, each of which is a list of characters - and a slightly
funny one involving two lists of characters which is used in some editor
implementations. Either way you still have the same underlying sequence,
just you break it up in various ways into easier to handle chunks.
3) Bitmapped images
These always come down to 'some metadata' and 'a 2D array of pixel
values'. The most variation is in the metadata; applications will tend to
have a native model which is identical to the metadata system of their
favourite format and map other formats in and out of that.
4) A table of information, SQL stylee
There's a fair few indexing schemes that can be applied here, but it's
still a table; an ordered multiset of tuples. (a pure relation in the
mathematical sense is an unordered set of tuples).
> In my experience it is painful and obfuscatory to use CSV for
> hierarchical or linked information. But if you and your customers enjoy
> it, then I'm glad you're using what works for you.
Not that much information is hierarchical, certainly by bulk... it's fine
for links, though, since it's pretty much the SQL data model and it's easy
to have foreign keys.
We're going to add a more hierarchical structure in future (to allow some
fields to contain lists and tables); the jury's still out on the details
of that for interchange, but XML probably still won't be a great contendor
since we'd ideally not have to change EVERYTHING about the file format
for a little thing like that. For now we'll probably go for something
...and just add another transition into the parser's state machine for the
+ symbol after a closing quote leading back to the state that comes after
a , with appropriate actions on the data buffer.
> Paul Prescod
Alaric B. Snell
http://www.alaric-snell.com/ http://RFC.net/ http://www.warhead.org.uk/
Any sufficiently advanced technology can be emulated in software