[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Schema ambiguity detection algorithm for RELAX (2/4)
- From: "K.Kawaguchi" <k-kawa@bigfoot.com>
- To: xml-dev@lists.xml.org
- Date: Fri, 26 Jan 2001 11:43:59 -0800
In this post, We will make a closer look to blanks in the score matrix,
which I proposed in the previous post.
In that post, I said when blanks are left, they can safely be assumed as
unambiguous. But why? this post answers this question.
To explain the proof clearly, I need a help of maneki-neko, our
well-known mascot cat. (BTW, say hello to him at http://www.xml.gr.jp/relax/)
Let G be an ambiguous grammar. More precisely, we don't know that G is
ambiguous. But since our cat is almighty, he knows it is.
And he knows the nice XML instance X, which can be interpreted in two
ways I and I', whereas we only know the existence of X.
Let me ask him a favor. Since he knows X and I and I'. I ask him to
pick up all elements that are given the different label by I and I'.
Since we have no idea about X, we cannot know these nodes.
However, we do know that such elements are disjoint, and therefore can
be considered as a set of interconnected parts. And each interconnected
part forms a tree of its own.
Ask him another favor. I want him to select one interconnected part(we
know it's a tree) from nodes he just picked up.
We have absolutely no idea of the tree he selected, but we know that
that tree is finite in terms of size.
One more favor. I want him to look at labels of the root node of the
tree he selected. Since we first ask him to pick up only those elements
that have two different labels, this root node has two different labels, too.
Name them Lr1 and Lr2, respectively.
We don't know what Lr1 and Lr2 actually are (only he knows), but we do
know that those labels are in the grammar G, and Lr1!=Lr2.
What I'd like to prove is, for these Lr1 and Lr2, our score matrix
always have "#" mark, which indicates they are in the ambiguous
relation, even when we don't know Lr1 and Lr2. (**)
We never know what Lr1 and Lr2 are, but for every possible such Lr1 and
Lr2, we have "#" mark on our score matrix. That's what I will prove.
Let's denote the tree, which is selected by our cat, by T. All we know
about T is its size is finite. And this the only thing we need to prove.
When the size of tree is finite, we always have at least one "leaf",
which is a node without any children (it may have children that do not
have different labels). Consider the set of all such nodes and name it
N0.
Again we don't know what nodes are in N0 at all, but we do know that
such a node actually exists. So name it n. n has two different labels,
so name it I(n) and I'(n) respectively. We do know that both I(n) and
I'(n) are members of the grammar, but that's the only thing we know.
n may have some children. Even so, All of its children have only one
label assigned to it, since n is a leaf of T.
Although we don't know I(n) and I'(n), since our algorithm tests every
possible pair of two labels, we test I(n) and I'(n) in this process,
without knowing so.
And our algorithm that tests the ambiguity of two labels
(detectTwoLabelsAmbiguity, dTLA in short ) is capable of detecting
ambiguity of these two even if all the score matrix are blank.
dTLA algorithm will be explained in the next post in better detail.
In short, the fact that each children of n has only one label, while n
has two different labels, are so decisive that dTLA cannot miss this
situation.
In summary, although only the cat knows the exact members of N0, we know
that for every such node n in N0 and for the two different labels I(n)
and I'(n), our score matrix always have "#" mark for them.
So if the height of T is 1, (**) is proved.
Do you think he may select a tree of height 2 or more? No problem.
If the height of T is 2 or more, we always have at least one node that
satisfies the following requirement.
* every children of n is either
(1) a member of N0, or (2) has only one label.
There may be more than one nodes that satisfy this requirement. But
there always at least one such node. Consider a set of all such nodes
and name it N1.
Again, we don't know anything about N1, except its size is not zero. So
let him pick one and name it n.
And again we name n's two different labels I(n) and I'(n) respectively.
We only know that these are the members of the grammar G. However, again
we tests every possible pair in the algorithm, we tests I(n) and I'(n)
without knowing.
And (again), I'll prove that our dTLA algorithm is wise enough to decide
these two labels are in the ambiguous relation if only one condition is
met. That condition is, we have # mark for every label pair that can be
assigned to nodes in N0.
As I mentioned earlier, those label pairs are so apparently ambiguous
that dTLA cannot miss it. So after the first round of testing, we have #
mark for every such possible label pairs.
So, in the second round of testing, this condition is always met.
What happens to dTLA if some information are missing? I'll prove in the
next post that dTLA never makes mistake. So the possible result from
dTLA under insufficient information is either
(1) it cannot decide ambiguity
(2) it detects these two are in the ambiguous relation.
And as you see, both are fine.
Some of you may think why dTLA can decide those two labels are in the
ambiguous relation if this condition is met. But to cut the long story
short, I'll skip this proof. (It's trivial, isn't it?)
As a whole, if the tree selected by him has 2 or less height, we know
that our score matrix always have "#" mark for every possible Lr1 and
Lr2, after certain iterative testing.
Now I can say that I proved (**) for height 2 or less.
In general, when the height of T is h or more, we can predict the
existence of n(h-1) such that
* every child of n(h-1) is either
(1) a member of N0, or
(2) a member of N1, or
(3) ...
(4) a member of N(h-2), or
(5) has only one label
And by the induction on the height of tree, I prove that for every tree
of finite height, (**) holds.
We don't know what he selects as T, but for whatever T he selects, (**)
is guaranteed.
Next, consider the parent of T. Because we've "normalized" the grammar,
the top level label is always S, the "virtual" root of XML instance may
not be a member of T. Therefore, T has always a parent node. So name it
p.
Again (oh, again!) we have absolutely no idea about what p is like. Its
existence is the only knowledge that we have.
p may have children. Name it c1,c2, ...
We don't know if it exists, nor how many. The cat knows everything, but
we don't. However, we can say that c1 has
* assigned the same label by both I and I'
* assigned two different label by I and I'
If the second is the case, such a child can be a root of T, so our score matrix
already has "#".
And since by the definition of tree T, p has only one label. Name it Lp.
We know that Lp is by definition a catalyst. The algorithm that checks
if a label is a catalyst (name it detectCatalyst "dC" which I'll explain
in the fourth post) is capable of telling that such Lp is a catalyst.
Intuitively this is because our score matrix has enough information (for
every possible label pairs that can be assigned to children of p, our
score matrix has a "#".)
As you see, the cat didn't tell me anything about X or T or p from
beginning to end. But the algorithm nonetheless detects that the grammar
is ambiguous.
So, in short I proved "if the grammar is in fact ambiguous, the
algorithm always detects it". Furthermore, it is trivial to prove that
the grammar is ambiguous if the algorithm detects the ambiguity.
By these two proof, I can say that this algorithm is a perfect (I mean,
complete as well as sound) algorithm to detect the ambiguity.
regards,
----------------------
K.Kawaguchi
E-Mail: k-kawa@bigfoot.com