RE: AW: [xml-dev] RFC for XML Object Parsing

I need to add one more fact, just to help everyone keep a clear picture of what is going on to accomplish this 3 fold leap in parsing speed that uses 1/2 the memory of any approach that parses to temporary objects. "oid" becomes Markup itself. It is optional 'bonus' markup for an element that describes a unique key for the element - if such a thing exists for that element.

Suppose you have a list of "foos". Suppose that list is very very long. 1 Terabyte. Suppose that you get updated XML source of that list every 15 minutes. In the huge list, some "foos" are new, some foos() have been updated. With an "oid" we can parse this entire update, and make no new memory allocations except of the new "foo's". The XML Parser never needs to allocate any memory storage if it can make a connection between the cached instance and an "oid".

The object knows it's "oid", what it does not know is anything about "oid" being in the XML. Normally we would call an attribute, "Data", - in this case it is not. It's a positional keyword in the XML. For example if an attribute named "oid" contains any upper case or the attribute is not in the 1st position - then it IS data - otherwise it is a key.

This will help you refine the main question.

Brian

From: xmlboss@live.com
To: hrennau@yahoo.de
CC: xml-dev@lists.xml.org
Date: Sun, 23 Mar 2014 07:31:41 -0600
Subject: RE: AW: [xml-dev] RFC for XML Object Parsing

Hans-Juergen Rennau,

> I wonder if I understood the gist of what you are saying correctly. My understanding of the operation "parsing XML" is the transformation of a string (usually serialized XML) into information content, modelled as an info set, which is a set of items defined in terms of their properties. Practically speaking, the result of parsing is an internal representation of the info set which provides some kind of interface to the information content. In short, parsing provides access to the info set. Would you accept this view?

I accept that 100%. This breaks down "Parsing" as the entire act of accomplishing the task of converting the marked up data from the linear contiguous form to a 3D form (As you say, "some kind of interface to the information content").

> The "oid" attribute provides a reference to some (perhaps binary) representation of an XML fragment. If an element foo has the @oid attribute, like so <foo @oid="123"/> the information content thus encoded is an info set with a "foo" element item whose children are provided by the internal representation referenced. Yes?

It seems No. I see it as "an element ETag". In HTTP, ETag, is a "caching key". "foo" doesn't know anything about an "oid" anymore than it knows about "<xml". An HTML page does not know its ETag. "foo"'s base class is even unaware of the tokenization process, this is why "foo" can be updated EITHER by using the "oid" OR without it and even "foo" itself would have no way of knowing by which mode of access it was updated - if it was directly from the source data or directly from a copy of it. "foo" is always updated directly according to "foo".

> So my main question is how you define the *result* of parsing XML using @oid. Is it an info set, or do you bypass that and are interested in a different end result, e.g. program objects loaded into application memory and not necessarily enabling the construction of an unambiguously defined infoset?

Your main question needs to be refined based on the concept variations of "an infoset" vs "a key or index". You had excellent comments and we are discussing thought beyond common terminology so we just need to speak in terms of concept rather than in terms of terms.

Thanks for comments - that draw out the definition of this thing.

Brian Aberle

> Date: Sun, 23 Mar 2014 12:06:30 +0000
> From: hrennau@yahoo.de
> To: xml-dev@lists.xml.org; xmlboss@live.com
> Subject: AW: [xml-dev] RFC for XML Object Parsing
>
> Hello Brian,
>
> I wonder if I understood the gist of what you are saying correctly. My understanding of the operation "parsing XML" is the transformation of a string (usually serialized XML) into information content, modelled as an info set, which is a set of items defined in terms of their properties. Practically speaking, the result of parsing is an internal representation of the info set which provides some kind of interface to the information content. In short, parsing provides access to the info set. Would you accept this view?
>
> The "oid" attribute provides a reference to some (perhaps binary) representation of an XML fragment. If an element foo has the @oid attribute, like so
> <foo @oid="123"/>
>
> the information content thus encoded is an info set with a "foo" element item whose children are provided by the internal representation referenced. Yes?
>
> So my main question is how you define the *result* of parsing XML using @oid. Is it an info set, or do you bypass that and are interested in a different end result, e.g. program objects loaded into application memory and not necessarily enabling the construction of an unambiguously defined infoset?
>
> Kind regards,
> Hans-Juergen Rennau
>
>
>
> _______________________________________________________________________

From: xmlboss@live.com
To: xml-dev@lists.xml.org
Date: Sat, 22 Mar 2014 23:40:50 -0600
Subject: [xml-dev] RFC for XML Object Parsing

Hello World,

I need an XML expert to correct me if I have any terminology wrong here. I wrote my first two XML parsers before W3C finalized XML 1.0 and I wrote my own XSLT - but I don't claim to know it all about XML even though folks with lesser study than me claim to know all about XML. Maybe someone here can intelligently comment on this:

Lets start with getting terminology right. "A Protocol" is a set of communication rules. When two parties agree on the specific use of a generic markup language like XML, they have agreed on a protocol. Is everyone with me so far? With this 'definition' of a protocol, your XML parser should be 'unaware' of any specific protocol as it deals with the general aspects of XML.

I propose adding a new keyword to XML, and I would like community feedback about it. It would work like this:

The tokenizer recognizes a special keyword attribute "oid" ONLY if it appears as the first attribute (because that is the only token we have parsed out yet in that element) Now the "Object ID" can be used to obtain the memory location (or application layer object instances) that the XML will parse directly into with no temporary memory copy into a tree or DOM structure. It's OVER twice as fast as the more traditional "memory copy design" naturally because the iterations to the temporary structure are eliminated, it goes beyond 2 times as fast because the tokenizer uses neither SAX nor DOM, but a more efficient alternative to SAX that avoids pushing a variable number of arguments depending on the token type via the SAX calls. The non-SAX design only makes calls to getToken(token *p) to pull the data over a 1 argument call stack. Data that SAX would push via too many argument that compile down to needless push's ands pop's. This implementation is about 3 times faster than the very best anyone can do with SAX, this makes it the most ideal solution for the massive sets used in a native BigData xml integration.

Since this thing(XML 1.2 or a new protocol) or has a requirement of an attribute named "oid" it could equally conceptually be a protocol (A protocol that the XML tokenizer is aware of?)  There is no other way to implement "the protocol". I have gone to much effort to try to communicate this clearly, and I developed a simple little example that breaks it all down into numbers that you can see and understand. The examples build on Linux and Windows.  Please give me some feedback about standardizing this. I want to know what some smart internet savvy people think about this. Am I in the right place? I'd like to see some community feedback about standardizing this.

As explained in the introduction in the article link below, oid is to XML what ETag is to HTTP. HTTP 1.0 did not standardize any way to cache web pages. HTTP 1.1 added Etag.  That same concept of caching allows XML to enter a whole new dimension of usage. Am I wrong? Look at Two example programs "TheOIDProtocol" and "ExIndexObjects".   The Numbers will have the final word.


Polished Source:
https://onedrive.live.com/redir?resid=D7EC275E76D295CF!923&authkey=!AAnvh0CKDY4nuho&ithint=file%2c.zip
A Rough (and Rogue) Draft article about this (open source) technology
http://www.codeproject.com/Articles/37850/XMLFoundation


Brian Aberle