OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Question for updating existing XML file

[ Lists Home | Date Index | Thread Index ]

Discussion on how to "validate an update" / "revalidate after update".

Brian Candler wrote:

>On Wed, Jul 28, 2004 at 10:12:20AM -0700, Jeff Greif wrote:
>>Unfortunately, the 'minimum set of checks' depends upon what you're
>>validating against.
>Precisely - I'm trying to establish what would be a good thing to validate
>against for this purpose :-)
My guess is Structural constraints as given by DTDs and W3C XML Schemas 
are slightly easier to revalidate.

For Relax NG, it depends on the kind of schema - it can be easy, but it 
can also be hard.

For crazy scientists, the former two are top-down deterministically 
recognizable, the latter not.
The simplest example is maybe this:
<f><a/>...<a/></f><b/> is valid
<f><b/>...<b/></f><a/> is valid
everything else is not.

This description (some might call it co-occurrence constraint) is not 
top-down deterministically recognizable: when you check the children of 
f, you have to remember that you need to check a b later, something 
which deterministic top-down tree automata cannot. instead 
nondeterministic ones (like the ones that correspond to Relax, they just 
"guess" what they need to expect. The implementation must then be 
bottom-up, i.e. from leaves to root)

Whether inserting an <a/> under f is ok, depends on the node to the 
right of the f. These dependencies can get arbitrarily nested, making 
revalidation of the whole tree necessary.

>>A feasible conceptual mechanism (leaving aside performance problems on large
>>files) would be to update a PSVI-decorated DOM, and revalidate (using DOM
>>Level 3 APIs) the nodes rooting the altered subtrees.
>I had not come across DOM validation until now. The spec mentions W3C Schema
>and DTD almost in passing; is it intended to work with any validation
>mechanism, or just these two? For example, with Schematron, it seems quite
>difficult to determine what are the valid children you could add at this
>point in the tree.
As I see it, having a DOM tree is conceptually equivalent to having 
whatever other sort of tree (though DOM is quite bloated).

Here's the idea from above, a bit elaborated: Assume your DOM is valid 
before the update.
Your validator should then have visited every node, carrying some 
information down there.
Modify it so that it stores this information "in" every node that it visits.

Now, when you do the update, you replace the "validation information" in 
every node from the path to the root by "unknown".

When you revalidate the whole tree, you will not need to descend in 
every node.

The rest depends on the validation mechanism:

DTDs are based on regular expression (even deterministic), so all you 
need to store is the state number of a (deterministic) finite state 

XML Schema is a bit trickier, but essentially the same.

Relax NG is based on tree regular expressions, so you will store the 
configuration of a finite state tree automaton of some sort.

Assuming the thing you inserted was valid, in the former two cases, it 
is enough to run the dfa on the "rest" of the sequence.
For Relax NG, you might (depending on your schema and your validation 
implementation) in the worst case need to traverse great portions of the 
entire tree, because of the "non-local" constraints you need to check. 
If you use somehing like pushdown hedge automata, you might get away 
with only traversing the right-upper half, but you need to store a pair 
of a state and a list of states in every node, which can be expensive 
for deeply nested XML trees.



News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS