Re: [xml-dev] DOM versus XDM: Differences in handling CDATAsections, en

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] DOM versus XDM: Differences in handling CDATAsections, entities, and concurrency

From: Amelia A Lewis <amyzing@talsever.com>
To: "Costello, Roger L." <costello@mitre.org>
Date: Fri, 12 Nov 2010 12:36:02 -0500

Errr.  This has so many mistaken assumptions that one is almost at a 
loss to determine where to start.

On Fri, 12 Nov 2010 11:38:56 -0500, Costello, Roger L. wrote:
> My understanding is that an XML document is first processed by an XML 
> parser, which creates an in-memory tree representation of the XML 
> document.

Not necessarily true.  See, for instance: SAX, StAX, JAXB, XMLBeans.  
There is no requirement for an in-memory representation of a tree, if 
you don't need one (and know how to avoid it).  The resultant tree may 
be simple, or it may have already been validated.  Validation is 
arguably either a part of the parsing layer, or a part of the 
processing layer.

> Then, an application such as an XML Schema validator or an 
> XSLT processor operates on the in-memory tree representation. Here is 
> a simple graphic I created to show this:
> 
> http://www.xfront.com/DOM-versus-XDM/How-an-XML-document-is-processed.gif

Nothing particularly wrong with that graphic, except that:

a) validation may happen during parse; and
b) no in-memory representation of a tree may be created (alternatives: 
events, partial trees, objects)

> It is my understanding that the in-memory model created by different 
> XML parsers may be different, depending on whether the XML parser 
> creates a DOM or XDM in-memory model. 

I do not know of a parser that "creates an XDM in-memory model."  I 
don't know all languages, either.  Languages for which DOM is defined 
(the classic three are C++, Javascript, and Java) do not, to the best 
of my knowledge and belief, have a public "XDM API" which includes a 
native in-memory representation.

I'm working on a project called GenXDM, which can expose an XDM API to 
processors (or "applications" in your graphic), but it runs over pretty 
much any XML tree model (a bridge must be created for that model).  I 
won't mention it further here, but for Java, I think it might behoove 
you to investigate.

For in-memory tree models, in Java, the following are reasonably 
well-known: DOM, AxiOM, JDOM, DOM4J, XOM.  In addition, SAX provides a 
message-callback style of interface.  StAX provides "pull parsing" 
suitable for projection of partial trees (it can be hooked to any 
in-memory representation, or processed as-you-go).  XMLBeans provides a 
sort of hybrid tree model/object binding.  JAXB is all about the 
binding.

In other news: my initial investigations of scala.xml tend to suggest 
that the authors were hooked into the XDM thing.  It's nice.  Python 
has several possible in-memory representations; but I'm not sure that 
any of them necessarily qualify as "DOM".

When you're speaking of "DOM", do you mean level one, or level two or 
three (and if one of the latter, which bits of those levels?).

> Here are two places where differences arise:
> 
>    - CDATA sections
>    - Entities

This is so boneheaded that I'm kind of at a loss.

DOM defines 15 Node types.  The infoset defines eleven.  The XQuery 
Data Model defines seven.  Noting two of the eight differences is 
somewhat less than stellar analysis, wouldn't you agree?

> Also, there are differences with respect to:
> 
>    - Concurrent access

Huh?  Isn't that going to depend upon your XDM implementation?

> As mentioned, there are two ways to model XML documents:

There are two ways to characterize the above sentence:

- it is mistaken
- it is deliberately false

>      Here is a graphic I created to show the DOM tree for the XML document:
> 
>      http://www.xfront.com/DOM-versus-XDM/DOM-implementation-of-CDATA.gif  

You've left out the (significant?) whitespace in the "hello" and 
"world" text nodes.

> Notice that in the DOM tree there are three nodes under the Element 

Well, that's not necessarily true.  The following is legal DOM.  Each 
node is represented in square brackets:

[<root>]
  [\n ]
  [h]
  [e]
  [ll]
  [o ]
  [<![CDATA[if A < B then ...]]>]
  [ w]
  [o]
  [rld ]
  [\n]

So, you see, my DOM tree has ten nodes under the root.  But that's more 
sensible, after all.  Vowels should *never* be in a text node except as 
the first character.  This is a very important tenet of 
DomVoweltarianism.

> DOM and XDM represent entities differently: 
> 
>    - A DOM tree will have a node for the entity, as evidenced by 
>      the fact that the DOM API has a method for accessing entities [4]. 
>      Here is a graphic I created to show the DOM tree of the XML document:

Have you ever considered actually looking at a DOM tree?

Hint: this is wrong.

If, on the other hand, you had the entity &Mydefinitionisnotavailable; 
then you might see a node ... for the *unresolved* entity.  Once it's 
resolved, it's replaced.

> There are occasions where multiple applications (processes) need to 
> operate on the same in-memory tree. Recently, Hans-Juergen Rennau 
> reported [5] problems with concurrent access to DOM trees. He found 
> no problems with concurrent access to XDM trees. 

Thanks for the reference.  Looks like something worth reading.

> 1. Is the above description and graphic of how XML documents are 
> processed correct?

No.

> 2. Is the above description and graphics of the differences in how 
> CDATA sections and entities are represented in DOM and XDM correct?

No.

> 3. Is the above description of the differences in thread-safety of 
> DOM and XDM correct?

Depends, really, upon what the implementation of the XDM is.  It *is* 
true that DOM makes thread-safety into a dance with live hand-grenades.

> 4. Will applications behave differently depending on whether the XML 
> parser it uses generates DOM or XDM? If so, isn't that really bad?

The XML parser will generate what it's asked to generate.  I don't 
really think that's particularly bad, somehow.

> 5. Do XML Schema validators use DOM or XDM to represent the XML 
> Schema and the XML instance document?

No.

> 6. If I were to create my own XML Schema validator, do I have the 
> option of choosing to use DOM or XDM?

No.  You're not so restricted.

> Or, does the XML Schema 
> specification require me to use one of them? If so, which one?

No.

> 7. Do XSLT processors use DOM or XDM to represent the XSLT document 
> and the XML instance document?

No.  (Reservation: Saxonica probably bases its internal representation 
on XDM, but I don't think anyone is foolish enough to use DOM as an XML 
tree model when the processor has had a performance analysis run on it).

> 8. If I were to create my own XSLT processor, do I have the option of 
> choosing to use DOM or XDM?

Sure.  Or SAX or StAX or JDOM or DOM4J or XOM or bloody JSON if you 
want.  EXI as an in-memory representation, anyone?

> Or, does the XSLT specification require 
> me to use one of them? If so, which one?

No.

> 9. For each of the following products, does it use DOM or XDM?

Are you for real?  The answer to almost all of these is "none of the 
above."

Amy!
-- 
Amelia A. Lewis                    amyzing {at} talsever.com
Simplicity is prerequisite for reliability.
                -- Edsger Dijkstra

Follow-Ups:
- Re: [xml-dev] DOM versus XDM: Differences in handling CDATAsections, entities, and concurrency
  - From: Amelia A Lewis <amyzing@talsever.com>

References:
- DOM versus XDM: Differences in handling CDATA sections, entities,and concurrency
  - From: "Costello, Roger L." <costello@mitre.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]