XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Why does XML call them "attributes" and not name-valuepairs?


Here are the results:

an XmlDocument: 3176 bytes

an XmlElement: 656 bytes

an XmlAttribute: 152 bytes.

This is likely to be true for pretty much any XML tree model
implementation. I'd be interested to know how one might avoid it.

Saxon's TinyTree model doesn't allocate a Java object per node. There are minimally 6 arrays:

// nodeKind indicates the kind of node, e.g. element, text, or comment
public byte[] nodeKind;

// depth is the depth of the node in the hierarchy, i.e. the number of ancestors
short[] depth;

// next is the node number of the next sibling
// - unless it points backwards, in which case it is the node number of the parent
int[] next;

// alpha holds a value that depends on the node kind. For text nodes, it is the offset
// into the text buffer. For comments and processing instructions, it is the offset into
// the comment buffer. For elements, it is the index of the first attribute node, or -1
// if this element has no attributes.
int[] alpha;

// beta holds a value that depends on the node kind. For text nodes, it is the length
// of the text. For comments and processing instructions, it is the length of the text.
// For elements, it is the index of the first namespace node, or -1
// if this element has no namespaces.
int[] beta;

// nameCode holds the name of the node, as an identifier resolved using the name pool
int[] nameCode;
each holding one entry per node; which accounts precisely for the measured space occupancy of 19 bytes for an empty element node.

Incidentally this structure not only has a very small memory footprint, it also (and perhaps more importantly) gives very good CPU caching behaviour. A search of the descendant axis for nodes with a particular name is a tight loop scanning the namecode array testing each integer against a constant.

Additional arrays are allocated (a) to hold type annotations, if the document is schema-validated, and (b) to hold "preceding" pointers, lazily constructed the first time the application attempts backwards navigation. There are some additional refinements for example to give faster parent access in a tree with very high fan-out (normally, finding the parent involves scanning the "next" pointers until one is found that points backwards.

Michael Kay
Saxonica

For an attribute, minimum number of pointers: name, value. Almost
always implemented: container (parent element). Rarely implemented:
sibling pointers (it's in some implementations, even though order is
not significant. Also, here we treat name as a pointer, not as its
three components or component pointers.

For an element: name, namespace collection pointer, attribute
collection pointer, parent pointer, and child collection start pointer.
In implementations optimized for usability rather than for maximal
memory compactness: last child pointer, previous and next sibling
pointers. One might maximally compact this by turning it into a
JSON-like model lacking many of the pointers, but it's still XML, so
name, pointer to three (two if namespaces are stuffed into the
attributes collection) collections instead of: name, pointer to one
value. You can program this as an XML tree, but the tree navigation API
has to actually fill in all the missing pointers, which is kind of
achieving a poor programmer's streaming-lite implementation.

This is based on experience with DOM, Axiom, JDOM, DOM4J models (plus a
couple of proprietary models). It's quite easy to add more pointers
(because making it easier to move to things is valuable, and that's
what most of the pointers do), but reducing below named container
contains value versus named container contains multiple kinds of
collections seems like a non-starter. Hmm, I guess you *could* treat
namespaces, attributes, and children as a single collection?

On second thought, if someone is doing this sort of trick to avoid it,
I don't need to know (unless you want to warn me of a project to avoid).

Also: yes, in performance optimization of tree models
built-for-the-purpose, minimizing the size of base (no content/empty)
nodes remains a consideration. Overall, though, implementing in a
fashion that permits streaming (freeing memory after 'consuming' parts
of the tree (typically with the whole thing in persistent storage
somewhere)) is preferred to speed-impacting reduction of internal
pointers.

Amy!
--
Amelia A. Lewis                    amyzing {at} talsever.com
The less I seek my source for some definitive, the closer I am to fine.
               -- Indigo Girls

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS