xml-dev - Re: [xml-dev] Some thoughts on 'direct access' to XML (long)

Re: [xml-dev] Some thoughts on 'direct access' to XML (long)

[ Lists Home | Date Index | Thread Index ]

To: Sean McGrath <sean.mcgrath@propylon.com>, xml-dev@lists.xml.org
Subject: Re: [xml-dev] Some thoughts on 'direct access' to XML (long)
From: Alaric Snell <alaric@alaric-snell.com>
Date: Sat, 18 Jan 2003 18:03:32 +0000
In-reply-to: <5.1.0.14.0.20030118104220.02392eb0@mail.propylon.com>
References: <5.1.0.14.0.20030118104220.02392eb0@mail.propylon.com>

On Saturday 18 January 2003 10:54 am, Sean McGrath wrote:

> If you do that you have - perhaps without realizing it - made your XML
> significantly
> less useful. Your XML has become process specific. Change your object
> structure (because of requirements changes or bugs) and all your
> serializations instantly turn into legacy. Interop with other systems
> is no better that it would have been with Java serialized objects,
> Python pickles, marshalled CORBA objects etc.

I disagree here on one level.

There is nothing really that different between XML, Java serialisation, 
Python pickles, and especially marshalled CORBA.

In all cases, you have some defined data model - XML's data model (like it or 
not!) being either 'bits of text wrapped in elements and attributes etc etc' 
or the whole PSVI thing, depending, Java serialisation's data model is based 
around a few basic types - chars, ints, longs, booleans, etc - plus a single 
constructed type, the object, which is nowt more than a a load of named 
fields with values. It has a concept of pointer which XML lacks. Java 
serialisation has nothing whatsoever to do with *methods* or *code*, however; 
it stands up without that. The only implementation widely in use happens to 
be written in Java and happens to parse the serialised stuff by construction 
Java objects, mind, and the de facto 'schema' language for Java serialisation 
happens to be Java classes (although the serialisation stuff ignores all the 
methods and just looks at class names, inheritance, and the definitions of 
fields).

You can put stuff in the Java methods to override serialisation behavour, 
true, but that's just part of the particular serialisation algorithm the 
aforementioned Java implementation uses.

There's nothing stopping you writing a C library to read and write Java 
serialised object files. You'd use it something like this:

FILE *fp = fopen ("test.ser","r");
Java *j = parseJavaObject (fp);

Java *j2 = getJavaObjectField (j, "firstName");
JavaCharArray *name = getJavaIntField (j2, "_data");

printf ("First name: %s\n", convertJavaCharArrayToCString (name));

destroyJavaObject (j);

I'm guessing off of the top of my head that the serialised form of a Java 
String object has an internal character array called _data; this is defined 
in some spec somewhere but I don't have it to hand.

Anyway - it's not really any different to talking PSVI, say.

CORBA marshalling is even more so; whereas Java serialisation is designed to 
be able to easily freeze the state of Java objects, and as such has a type 
system suspicously related to Java's, CORBA was designed around a new data 
model from scratch, like ASN.1, that's intended to just be good for modelling 
information. 

The difference between XML/CORBA/BER/PER/DER/etc on one hand, as 
language-independent data modelling systems, and Java serialisation / Python 
pickling / etc is really only down to two things:

1) The data model being designed to map seamlessly into a particular 
language's model

2) Nobody spending the effort writing other implementations

Now, where I work, we were storing serialised Java in a database (for various 
reasons, mainly to do with getting around inflexibilities in SQL's type 
system). However, we wanted to offload certain operations to the database 
server as a stored procedure written in C.

We were already using a modified serialisation mechanism to avoid space 
inefficiencies in Java's serialisation, but I believe it would have been just 
as easy using straight Java (although I'd have had to hunt down the 
serialisation specs first rather than just having our specs for our own 
format lying around); but we wrote a C interface to our serialisation format, 
and we are now performing processing on these Java objects from C, and we're 
happy! We can't call the Java methods from C, but we can access all the 
object's fields...

> 7. Programming languages can and should move past SAX/DOM for
> accessing XML. For pure document processing, they both have
> their place but for Objects and Records (as the terms are used
> in mainstream programming), they are sorely lacking. I believe
> it is entirely possible to make the programmers life easy
> WITHOUT turning BOXED XML into basket of object-serialization
> technologies.

I'd agree with you there, though, but only because I don't think that 
particular serialisation techniques really impinge that much on the formats 
of the underlying bit streams. Put it this way, Java serialisation, my own 
custom compact Java serialisation, and XML all have, to a significant extent, 
the following structure for a 'compound thing':

 - Some kind of thing-type identifier 
 - Maybe a length count here
 - A list of things that are nested inside the current thing
 - If we didn't have a length count, and end-thing marker

Then a notion of one or more 'non-compound things' which are marked as such 
in their type identifier and/or by context, and have the same basic format 
but something else instead of the 'list of things'.

In my serialisation format, the thing-type identifier is a single byte with a 
specified list of values. There can never be more than 256 possibly types, 
you might think, but type zero is 'user class' (followed by the class name) 
while all the others are things like backpointers to previously serialised 
objects, end markers, all the basic Java types (int, long, boolean, etc) plus 
a host of the java.util.* collection classes which I special-case the 
encoding of to make them more compact.

In XML, thing-type is a QNAME, there's a slight caveat in that the nested 
things might be attributes or child elements, and non-compound things are 
PIs, comments, CDATAs, etc.

In Java serialisation, from vague memory, the thing-type ID is a single byte 
for basic types and back-pointers, or L followed by a class name for 
everything else, and there's a special marker that goes after the last field 
of an object to mark the end rather than a length code. I used length codes 
in my serialisation thingy because the C code does some fast xpath-esque 
stuff where it pulls out a single field or two from an object without 
constructing the entire thing as a tree in memory.

If Java serialisation were a bit more compact and they had libraries for 
reading it from C, we might have used it as is. If it was still large but had 
C libraries I'd have been sorely tempted, and if it was compact but lacked C 
libraries I'd have just written the C libraries...

ABS

-- 
Oh, pilot of the storm who leaves no trace, Like thoughts inside a dream
Heed the path that led me to that place, Yellow desert screen

References:
- Some thoughts on 'direct access' to XML (long)
  - From: Sean McGrath <sean.mcgrath@propylon.com>

Prev by Date: Re: [xml-dev] Facts to Support RAND? was: Re: [xml-dev] more pate nt fun
Next by Date: Re: [xml-dev] Facts to Support RAND? was: Re: [xml-dev] more pate nt fun
Previous by thread: Re: [xml-dev] Some thoughts on 'direct access' to XML (long)
Next by thread: RE: [xml-dev] Some thoughts on 'direct access' to XML (long)
Index(es):
- Date
- Thread