OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Schema ambiguity detection algorithm for RELAX (1/4)




While XML Schema gains more and more weight, we have two relatively
light-weight schema language. One is RELAX, and the other is TREX.
And I think many people is curious about the difference of these two
languages.


This post proposes an algorithm to detect the ambiguity of schema
written by RELAX.

This is a very long post, but this is only the first quarter of the
whole story(!)

One of the purpose is, I really want other people's eyes to verify this
algorithm.
Another purpose is to prove that such an algorithm is actually possible
for RELAX, whereas it's probably much harder for TREX.

I'd appreciate any comments from you about this.


If you are not familiar with RELAX, go to http://www.xml.gr.jp/relax/
If you don't know TREX, go to http://www.thaiopensource.com/trex/






Ambiguity Detection Algorithm for RELAX
=======================================

This paper proposes an algorithm to detect ambiguity for grammars(schema)
written in RELAX.


In this paper, I assumes that every grammar must satisfies the following
constraints.

* Each element rule has an unique label name
* hedge rules are not present.

It is trivial to convert any valid RELAX grammar into this "normalized"
form.



1. What do I mean by "ambiguous grammar".

Say G is a grammar written in RELAX. Let us think about "ambiguous
grammar" first.

Let X be an XML instance that is valid to G. Since X is valid, verifier
can assign label and role for every element in X (this step is called
"type-assignment"). If G is a well-written grammar, the assignment of
label/role can be done uniquely. That is, only one assignment satisfies
all the constraints imposed by the grammar G.

However, if G has some flaw, sometimes it is possible to assign
label/role in more than one way. I'm talking about these grammars when I
say "grammar is ambiguous", 

Formally, ambiguity of the grammar G is defined as follows:

----------------------------
There exists a valid document X, and there exists at least two
label/role assignment to X
----------------------------

For example, the following grammar is ambiguous.

<elementRule label="start">
  <tag name="start" />
  <choice>
    <ref label="foo.1" />
    <ref label="foo.2" />
  </choice>
</elementRule>

<elementRule label="foo.1">
  <tag name="foo" />
  <ref label="bar" />
  </sequence>
</elementRule>

<elementRule label="foo.2">
  <tag name="foo" />
  <ref label="bar" occurs="*" />
</elementRule>

<elementRule label="bar" type="string">
  <tag name="bar" />
</elementRule>

Because there are two different assignment for the following instance.

<start>        : start
  <foo>        : foo.1? foo.2?
    <bar />    : bar
  </foo>
</start>

However, of course, for some documents that conforms this grammar, we
can assign label/role uniquely.

<start>        : start
  <foo>        : foo.2
    <bar />    : bar
    <bar />    : bar
  </foo>
</start>




2. Interpretation

Verifier can assign label/role for every elements in an XML instance.
The assigned label/role information is called "interpretation".

For example, when I say "interpretation of X by G", it means the entire
label/role information assigned to all elements of X. Or, when I say
"interpretation of element E by G", it means the entire label/role
information assigned to element E and its descendants.

I'll abuse :-) the word "interpretation" further. When I say
"interpretation of element E by the label L", it means the label/role
assignment in which label L is assigned to element E.

By using this new word "interpretation", we can define ambiguity of the
grammar in much concise way.

--------------------------
There exists a valid instance X and at least two different
interpretation of X by G.
--------------------------




3. Ambiguity of label

I've just defined the ambiguity of grammar. In this section, I discuss
the ambiguous relation between two labels and defines it.

What do I mean by "label L and R are in ambiguous relation"? Intuitively,
there exists such a nice XML element E that E can be interpreted as L,
as well as R. I call such an indistinguishable pair "ambiguous relation".

We can extend this definition to define "label L is ambiguous".
Informally, when there exists a nice XML element E such that it can be
interpreted by L in more than one way, we call such a label ambiguous.

Instead of defining "label L is ambiguous" in another way, let us define
"label L is ambiguous" by "label L and L are in ambiguous relation".

Formal definition follows.

--------------------------
Label L and R are in ambiguous relation if and only if there exists an
XML element E, and two different interpretations (one by L and one by R).
--------------------------

Ambiguity of labels is important because actually this is the
fundamental building block of grammar ambiguity.




4. Link between label ambiguity and grammar ambiguity

Let us go back to think about an ambiguous grammar G again. Since G is
an ambiguous grammar, we have X, which is an XML instance that can be
interpreted by G in more than one way.
We name one interpretation I, and another interpretation I'. By
definition of grammar ambiguity, we can actually have such two different
interpretations (I and I').

What does it mean that I and I' are different? The answer is, there
exists at least one XML element E in X such that I and I' give a
different label for this element. Say interpretation I gives label L to
E, and I' gives R to E.

If we look at this situation from another point of view, we can say that
L and R are in ambiguous relation, because E can be interpreted by both
L and R, and this is exactly the definition of label ambiguity!


Therefore, from the above reasoning, it is proved that when the grammar
is ambiguous, there always exists at least one pair of labels that are
in ambiguous relation. (Note that this one pair may be a pair of the same
label.)

And, this proposition also holds (because this is a contraposition of
the previous proposition).

if no pair of labels in the grammar is in ambiguous relation, the
grammar is NOT ambiguous.




5. Catalyst

The bad news is, the above propositions are not the end of the story.
Because the above propositions does NOT guarantee the following
statement.

--------------------------
If there exists an ambiguous pair of labels in the grammar, the grammar
is ambiguous.
--------------------------

And by considering the following example, we know that the above
statement is actually not correct.


<interface>
  <export label="root" />
</interface>

<elementRule label="root">
  <tag name="e" />
  <sequence>
    <ref label="foo" />
    <ref label="bar" />
  </sequence>
</elementRule>

<elementRule label="foo">
  <tag name="e" />
  <empty />
</elementRule>

<elementRule label="bar">
  <tag name="e" />
  <empty />
</elementRule>

Think about XML fragment "<e/>". As you can see, this fragment can be
interpreted by both "foo" and "bar". So "foo" and "bar" are in ambiguous
relation.

However, if we think in terms of grammar, there is only one instance
that can be interpreted by this grammar. That is,

<e>
  <e/>
  <e/>
</e>

And the interpretation for this instance is unique, as follows.

<e>    :root
  <e/> :foo
  <e/> :bar
</e>

As you can see, no other interpretation is possible. Thus the grammar is
unambiguous.


Let us think about a bit more about this example. When we thought in
terms of labels, "foo" and "bar" are actually in the ambiguous relation.
However, when we assigned label by starting from top-level label, we
didn't have any chance to choose between "foo" and "bar". If we had a
chance to choose, we can make this grammar ambiguous. But this grammar
didn't give us such a chance.

And that's why this grammar is NOT ambiguous, although it has ambiguous
labels.

If I modify the grammar so that we have a chance to choose, the modified
grammar becomes ambiguous. See the following modified version.

<elementRule label="root2">
  <tag name="e" />
  <choice>
    <ref label="foo" />
    <ref label="bar" />
  </choice>
</elementRule>

<e>    : root2
  <e/> : foo? bar?
</e>


So, what did we learn? We learned that for a grammar to be ambiguous, we
need the third label, which acts as catalyst. This is the key factor
that makes a grammar ambiguous.
From now on, I call this third label as "catalyst".




6. Properties of catalyst

If label L is a catalyst, label L is ambiguous. The proof can be easily
given by the definition of ambiguous labels.

And this property is perhaps the most important one. By using this
property, I'd like to prove the following statement always holds.

--------------------------
the grammar is ambiguous if and only if it has at least one catalyst.
--------------------------

Effectively, this statement (and the above property) says that the
following three bullets are equivalent.

* the grammar is ambiguous
* the grammar has one (or more) catalyst
* the grammar has one (or more) self-ambigous label

The proof is straight-forward.

(1) G has self-ambiguous label L -> G is ambiguous

Since L is ambiguous, we have a nice element E such that it can be
interpreted in more than one way by L. So by using this E and adding
some more elements, we can create a valid XML instance X, which is
interpreted by more than one ways.

(2) G is ambiguous -> G has self-ambigous label L

Since G is ambiguous, the top-level label of G is by definition
self-ambiguous.

(3) G has a catalyst <-> G has a self-ambiguous label

The proof is given by the previous property of catalyst.








7. Strategy of Algorithm

By the above consideration, we now have a core strategy of the algorithm.
That is,

1. compute the ambiguity between two labels.
2. compute the self-ambiguity of every label
3. if no label is judged self-ambiguous, then the grammar is NOT
   ambiguous.
4. if any label is judged self-ambiguous, then the grammar is ambiguous.
5. otherwise, the grammar is unambiguous.
   this proof is not trivial, so let me explain this in another mail.


Consider the step-1 further by studying the following example.

<elementRule label="foo" role="X">
  <ref label="alpha" />
</elementRule>

<elementRule label="bar" role="X">
  <ref label="bravo" />
</elementRule>

We have two labels and we'd like to compute if these two are in ambiguous
relation. Since we have the same role for both labels, the ambiguity of
these two solely depends on the ambiguity between "alpha" and "bravo".

If "alpha" and "bravo" are in ambiguous relation, we can say that "foo"
and "bar" is also in ambiguous relation.  If they are not, then "foo"
and "bar" is also NOT in ambiguous relation.

So as you see, sometimes ambiguity relation between labels depends on
the ambiguity relation between other labels.


With this in mind, I'd like to explain the core strategy.

(1) First of all, create a "score matrix". Say we have five labels
(A,B,C,D, and E). The score matrix is as follows.

| B| C| D| E|
+--+--+--+--+--
|  |  |  |  |A
+--+--+--+--+--
   |  |  |  |B
   +--+--+--+--
      |  |  |C
      +--+--+--
         |  |D
         +--+--

Note that there are no A vs A or no B vs B. At this moment, we don't
want to test self-ambiguity of the label.

If the box is left blank, it means we don't know if those two are in
ambiguous relation or not.
When the box is marked by #, it means we DO know that those two are in
ambiguous relation.
And we'll mark it by '.' when we DO know that those two are NOT in
ambiguous relation.


(2) Compute label ambiguity.

For every box that is left blank, compute the ambiguity between the two
again. And mark the result as I described earlier. This algorithm is
also non-trivial. So let me explain this in the separate mail.

If we get any new information from this, do step (2) again. As we
considered earlier, some depends on the other, so new information may
help bringing an end to another pair.

If no blank left, go to the next step.

If some blank are left but none of them can be decided, then we are in
trouble. We'll think about this situation in another mail.

To cut the long story short, in this case, we can safely assume that
these labels are not ambiguous. So mark all the blanks by '.' and go to
step 3.


(3) Find a catalyst.

For every label in the grammar, check whether that label can be a
catalyst or not. This algorithm will be also explained in the separate
mail.

If such a label is found, the entire grammar is ambiguous. If no such
label is found, then the entire grammar is NOT ambiguous.




regards,
----------------------
K.Kawaguchi
E-Mail: k-kawa@bigfoot.com