[
Lists Home |
Date Index |
Thread Index
]
The concept of a pointer you say XML lacks is solved at the local (i.e. document) level by IDs and at the global (multi-document) level by XPointer (if it ever ships).
-----Original Message-----
From: Alaric Snell [mailto:alaric@alaric-snell.com]
Sent: Sat 1/18/2003 10:03 AM
To: Sean McGrath; xml-dev@lists.xml.org
Cc:
Subject: Re: [xml-dev] Some thoughts on 'direct access' to XML (long)
On Saturday 18 January 2003 10:54 am, Sean McGrath wrote:
> If you do that you have - perhaps without realizing it - made your XML
> significantly
> less useful. Your XML has become process specific. Change your object
> structure (because of requirements changes or bugs) and all your
> serializations instantly turn into legacy. Interop with other systems
> is no better that it would have been with Java serialized objects,
> Python pickles, marshalled CORBA objects etc.
I disagree here on one level.
There is nothing really that different between XML, Java serialisation,
Python pickles, and especially marshalled CORBA.
In all cases, you have some defined data model - XML's data model (like it or
not!) being either 'bits of text wrapped in elements and attributes etc etc'
or the whole PSVI thing, depending, Java serialisation's data model is based
around a few basic types - chars, ints, longs, booleans, etc - plus a single
constructed type, the object, which is nowt more than a a load of named
fields with values. It has a concept of pointer which XML lacks. Java
serialisation has nothing whatsoever to do with *methods* or *code*, however;
it stands up without that. The only implementation widely in use happens to
be written in Java and happens to parse the serialised stuff by construction
Java objects, mind, and the de facto 'schema' language for Java serialisation
happens to be Java classes (although the serialisation stuff ignores all the
methods and just looks at class names, inheritance, and the definitions of
fields).
You can put stuff in the Java methods to override serialisation behavour,
true, but that's just part of the particular serialisation algorithm the
aforementioned Java implementation uses.
There's nothing stopping you writing a C library to read and write Java
serialised object files. You'd use it something like this:
FILE *fp = fopen ("test.ser","r");
Java *j = parseJavaObject (fp);
Java *j2 = getJavaObjectField (j, "firstName");
JavaCharArray *name = getJavaIntField (j2, "_data");
printf ("First name: %s\n", convertJavaCharArrayToCString (name));
destroyJavaObject (j);
I'm guessing off of the top of my head that the serialised form of a Java
String object has an internal character array called _data; this is defined
in some spec somewhere but I don't have it to hand.
Anyway - it's not really any different to talking PSVI, say.
CORBA marshalling is even more so; whereas Java serialisation is designed to
be able to easily freeze the state of Java objects, and as such has a type
system suspicously related to Java's, CORBA was designed around a new data
model from scratch, like ASN.1, that's intended to just be good for modelling
information.
The difference between XML/CORBA/BER/PER/DER/etc on one hand, as
language-independent data modelling systems, and Java serialisation / Python
pickling / etc is really only down to two things:
1) The data model being designed to map seamlessly into a particular
language's model
2) Nobody spending the effort writing other implementations
Now, where I work, we were storing serialised Java in a database (for various
reasons, mainly to do with getting around inflexibilities in SQL's type
system). However, we wanted to offload certain operations to the database
server as a stored procedure written in C.
We were already using a modified serialisation mechanism to avoid space
inefficiencies in Java's serialisation, but I believe it would have been just
as easy using straight Java (although I'd have had to hunt down the
serialisation specs first rather than just having our specs for our own
format lying around); but we wrote a C interface to our serialisation format,
and we are now performing processing on these Java objects from C, and we're
happy! We can't call the Java methods from C, but we can access all the
object's fields...
> 7. Programming languages can and should move past SAX/DOM for
> accessing XML. For pure document processing, they both have
> their place but for Objects and Records (as the terms are used
> in mainstream programming), they are sorely lacking. I believe
> it is entirely possible to make the programmers life easy
> WITHOUT turning BOXED XML into basket of object-serialization
> technologies.
I'd agree with you there, though, but only because I don't think that
particular serialisation techniques really impinge that much on the formats
of the underlying bit streams. Put it this way, Java serialisation, my own
custom compact Java serialisation, and XML all have, to a significant extent,
the following structure for a 'compound thing':
- Some kind of thing-type identifier
- Maybe a length count here
- A list of things that are nested inside the current thing
- If we didn't have a length count, and end-thing marker
Then a notion of one or more 'non-compound things' which are marked as such
in their type identifier and/or by context, and have the same basic format
but something else instead of the 'list of things'.
In my serialisation format, the thing-type identifier is a single byte with a
specified list of values. There can never be more than 256 possibly types,
you might think, but type zero is 'user class' (followed by the class name)
while all the others are things like backpointers to previously serialised
objects, end markers, all the basic Java types (int, long, boolean, etc) plus
a host of the java.util.* collection classes which I special-case the
encoding of to make them more compact.
In XML, thing-type is a QNAME, there's a slight caveat in that the nested
things might be attributes or child elements, and non-compound things are
PIs, comments, CDATAs, etc.
In Java serialisation, from vague memory, the thing-type ID is a single byte
for basic types and back-pointers, or L followed by a class name for
everything else, and there's a special marker that goes after the last field
of an object to mark the end rather than a length code. I used length codes
in my serialisation thingy because the C code does some fast xpath-esque
stuff where it pulls out a single field or two from an object without
constructing the entire thing as a tree in memory.
If Java serialisation were a bit more compact and they had libraries for
reading it from C, we might have used it as is. If it was still large but had
C libraries I'd have been sorely tempted, and if it was compact but lacked C
libraries I'd have just written the C libraries...
ABS
--
Oh, pilot of the storm who leaves no trace, Like thoughts inside a dream
Heed the path that led me to that place, Yellow desert screen
-----------------------------------------------------------------
The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
initiative of OASIS <http://www.oasis-open.org>
The list archives are at http://lists.xml.org/archives/xml-dev/
To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl>
|