RE: [xml-dev] RFC for XML Object Parsing

Michael,

Good stuff you gave me here.

You said, So actually the recipient can do what it likes with it, and hashing it to a memory address is just one implementation possibility?

Yes

You said, But uniqueness demands context. What scope is the OID unique within? All documents that the recipient has processed, ever? All documents from the same sender? All elements within the same document? (If it's from all senders, then how can a sender make an invoice number unique?).

<Example oid='1'/>
<OtherExample oid='1'/>

OID would be unique per XML Element. There are several use cases for this, first consider data in your own DBMS being used in your own application layer. This is ideal because 'we' control 'our own' data. May we also add to this ideal implementation that we have properly normalized data in the DBMS. In this case, you will likely use your DBMS index values directly as the OID. In the same context that your DBMS keys are guaranteed to be unique, they serve as oid's.

Other people do not define what our index data is, so my EDI example was poor. An SSN, or Vendor ID might be a more realistic OID for in inter-business kind of XML document. In some cases, a GUID is the best OID.

You said, My understanding is now that the OID means "you might already know this element. If you do, you can ignore its content. If you don't, process it in the normal way, and make a note of the OID in case you ever see it again". Is that right?

The content is not "ignored" it is "updated" - UNLESS the UpdateTime tells us not to update. The OID is the key to "the element we might already know", and yes you can give me a new OID that I was unaware of, then reference it later.

You said, Next question: where does the claim of 3-fold performance improvement come from? Isn't that entirely dependent on the number of cache hits, i.e. highly variable depending on the workload?

It can be quite a bit more than 3 times faster. Start with this: if there is no OID, there will be a temporary data structure setup. Iif we keep this memory allocated then we can copy to it quickly, suppose it takes 3 time units because we were smart and pre-allocated this temporary memory. It will take 3 more time units to move the data to its final destination. Base assumption 2X performance improvement and 50% memory reduction by an OID design.

Next, assume SAX is the fastest way available to you. This design can be implemented with SAX, however as Amy pointed out, there are some SAX implementations where this is apparently not the case. The weakness of SAX is the number of arguments in the function calls. A C++ compiler turns that into machine code it has to push each argument, call, then pop each argument. It's a non-trivial amount of time consumed during the tokenization. In fact it was the single largest consumer of CPU cycles according to the profiler I was using. I have a Non-SAX way to parse XML. It uses getToken( token *p ) to retrieve ALL token types. The machine code is tremendously simplified to the tune of a nearly 2X performance improvement over SAX, therefore the claim of a 3X performace improvement is founded in performance stats.

Finally, look at the new use cases we now have for XML. You can give me a partial update via an OID implementation. That might be many times faster than a complete update.

You said, The parsing optimization presumably requires that if the OID is recognized, then we can skip to the end of the element with minimal parsing cost.

I think I caused this confusion with reference to HTTP Etag(that is basically how etag works). Yes, you are correct, ONLY if the UpdateTime would tell us to ignore this particular element in the XML. That is NOT the typical use. Most typically, the OID tells us to retrieve our cached storage for this element, and typically the data that follows will update that object if it existed.

Brian