xml-dev - RE: [xml-dev] Some thoughts on 'direct access' to XML (long)

RE: [xml-dev] Some thoughts on 'direct access' to XML (long)

[ Lists Home | Date Index | Thread Index ]

To: "Alaric Snell" <alaric@alaric-snell.com>,"Sean McGrath" <sean.mcgrath@propylon.com>,<xml-dev@lists.xml.org>
Subject: RE: [xml-dev] Some thoughts on 'direct access' to XML (long)
From: "Dare Obasanjo" <dareo@microsoft.com>
Date: Sat, 18 Jan 2003 10:37:16 -0800
Thread-index: AcK/HAdwkcSgERteTvi0kKrkB4SbjwABFEzv
Thread-topic: [xml-dev] Some thoughts on 'direct access' to XML (long)

The concept of a pointer you say XML lacks is solved at the local (i.e. document) level by IDs and at the global (multi-document) level by XPointer (if it ever ships). 

	-----Original Message----- 
	From: Alaric Snell [mailto:alaric@alaric-snell.com] 
	Sent: Sat 1/18/2003 10:03 AM 
	To: Sean McGrath; xml-dev@lists.xml.org 
	Cc: 
	Subject: Re: [xml-dev] Some thoughts on 'direct access' to XML (long)
	
	

	On Saturday 18 January 2003 10:54 am, Sean McGrath wrote:
	
	> If you do that you have - perhaps without realizing it - made your XML
	> significantly
	> less useful. Your XML has become process specific. Change your object
	> structure (because of requirements changes or bugs) and all your
	> serializations instantly turn into legacy. Interop with other systems
	> is no better that it would have been with Java serialized objects,
	> Python pickles, marshalled CORBA objects etc.
	
	I disagree here on one level.
	
	There is nothing really that different between XML, Java serialisation,
	Python pickles, and especially marshalled CORBA.
	
	In all cases, you have some defined data model - XML's data model (like it or
	not!) being either 'bits of text wrapped in elements and attributes etc etc'
	or the whole PSVI thing, depending, Java serialisation's data model is based
	around a few basic types - chars, ints, longs, booleans, etc - plus a single
	constructed type, the object, which is nowt more than a a load of named
	fields with values. It has a concept of pointer which XML lacks. Java
	serialisation has nothing whatsoever to do with *methods* or *code*, however;
	it stands up without that. The only implementation widely in use happens to
	be written in Java and happens to parse the serialised stuff by construction
	Java objects, mind, and the de facto 'schema' language for Java serialisation
	happens to be Java classes (although the serialisation stuff ignores all the
	methods and just looks at class names, inheritance, and the definitions of
	fields).
	
	You can put stuff in the Java methods to override serialisation behavour,
	true, but that's just part of the particular serialisation algorithm the
	aforementioned Java implementation uses.
	
	There's nothing stopping you writing a C library to read and write Java
	serialised object files. You'd use it something like this:
	
	FILE *fp = fopen ("test.ser","r");
	Java *j = parseJavaObject (fp);
	
	Java *j2 = getJavaObjectField (j, "firstName");
	JavaCharArray *name = getJavaIntField (j2, "_data");
	
	printf ("First name: %s\n", convertJavaCharArrayToCString (name));
	
	destroyJavaObject (j);
	
	I'm guessing off of the top of my head that the serialised form of a Java
	String object has an internal character array called _data; this is defined
	in some spec somewhere but I don't have it to hand.
	
	Anyway - it's not really any different to talking PSVI, say.
	
	CORBA marshalling is even more so; whereas Java serialisation is designed to
	be able to easily freeze the state of Java objects, and as such has a type
	system suspicously related to Java's, CORBA was designed around a new data
	model from scratch, like ASN.1, that's intended to just be good for modelling
	information.
	
	The difference between XML/CORBA/BER/PER/DER/etc on one hand, as
	language-independent data modelling systems, and Java serialisation / Python
	pickling / etc is really only down to two things:
	
	1) The data model being designed to map seamlessly into a particular
	language's model
	
	2) Nobody spending the effort writing other implementations
	
	Now, where I work, we were storing serialised Java in a database (for various
	reasons, mainly to do with getting around inflexibilities in SQL's type
	system). However, we wanted to offload certain operations to the database
	server as a stored procedure written in C.
	
	We were already using a modified serialisation mechanism to avoid space
	inefficiencies in Java's serialisation, but I believe it would have been just
	as easy using straight Java (although I'd have had to hunt down the
	serialisation specs first rather than just having our specs for our own
	format lying around); but we wrote a C interface to our serialisation format,
	and we are now performing processing on these Java objects from C, and we're
	happy! We can't call the Java methods from C, but we can access all the
	object's fields...
	
	> 7. Programming languages can and should move past SAX/DOM for
	> accessing XML. For pure document processing, they both have
	> their place but for Objects and Records (as the terms are used
	> in mainstream programming), they are sorely lacking. I believe
	> it is entirely possible to make the programmers life easy
	> WITHOUT turning BOXED XML into basket of object-serialization
	> technologies.
	
	I'd agree with you there, though, but only because I don't think that
	particular serialisation techniques really impinge that much on the formats
	of the underlying bit streams. Put it this way, Java serialisation, my own
	custom compact Java serialisation, and XML all have, to a significant extent,
	the following structure for a 'compound thing':
	
	 - Some kind of thing-type identifier
	 - Maybe a length count here
	 - A list of things that are nested inside the current thing
	 - If we didn't have a length count, and end-thing marker
	
	Then a notion of one or more 'non-compound things' which are marked as such
	in their type identifier and/or by context, and have the same basic format
	but something else instead of the 'list of things'.
	
	In my serialisation format, the thing-type identifier is a single byte with a
	specified list of values. There can never be more than 256 possibly types,
	you might think, but type zero is 'user class' (followed by the class name)
	while all the others are things like backpointers to previously serialised
	objects, end markers, all the basic Java types (int, long, boolean, etc) plus
	a host of the java.util.* collection classes which I special-case the
	encoding of to make them more compact.
	
	In XML, thing-type is a QNAME, there's a slight caveat in that the nested
	things might be attributes or child elements, and non-compound things are
	PIs, comments, CDATAs, etc.
	
	In Java serialisation, from vague memory, the thing-type ID is a single byte
	for basic types and back-pointers, or L followed by a class name for
	everything else, and there's a special marker that goes after the last field
	of an object to mark the end rather than a length code. I used length codes
	in my serialisation thingy because the C code does some fast xpath-esque
	stuff where it pulls out a single field or two from an object without
	constructing the entire thing as a tree in memory.
	
	If Java serialisation were a bit more compact and they had libraries for
	reading it from C, we might have used it as is. If it was still large but had
	C libraries I'd have been sorely tempted, and if it was compact but lacked C
	libraries I'd have just written the C libraries...
	
	ABS
	
	--
	Oh, pilot of the storm who leaves no trace, Like thoughts inside a dream
	Heed the path that led me to that place, Yellow desert screen
	
	-----------------------------------------------------------------
	The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
	initiative of OASIS <http://www.oasis-open.org>
	
	The list archives are at http://lists.xml.org/archives/xml-dev/
	
	To subscribe or unsubscribe from this list use the subscription
	manager: <http://lists.xml.org/ob/adm.pl>

Follow-Ups:
- Re: [xml-dev] Some thoughts on 'direct access' to XML (long)
  - From: "Alaric B. Snell" <alaric@alaric-snell.com>

Prev by Date: Re: [xml-dev] Facts to Support RAND? was: Re: [xml-dev] more pate nt fun
Next by Date: Re: [xml-dev] ConciseXML syntax (some examples)
Previous by thread: Some thoughts on 'direct access' to XML (long)
Next by thread: Re: [xml-dev] Some thoughts on 'direct access' to XML (long)
Index(es):
- Date
- Thread