[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Schema ambiguity detection algorithm for RELAX (3/4)
- From: "K.Kawaguchi" <k-kawa@bigfoot.com>
- To: xml-dev@lists.xml.org
- Date: Fri, 26 Jan 2001 11:44:05 -0800
Checking ambiguity of two labels
================================
This post proposes an algorithm that detects ambiguity of two labels,
as mentioned in the previous post.
I named this algorithm "detectTwoLabelsAmbiguity" (dTLA). In this post,
this algorithm is the only topic.
dTLA receives the following items as its input.
* The first label L1
(all the information written in one elementRule)
* The second label L2
(all the information written in one elementRule)
* "Score matrix"
as defined in the first post. This matrix is only partially complete.
So some fields may be blank.
As we saw in the first post, sometimes information of score matrix is
necessary to determine the ambiguous relation of L1 and L2, and that's
why dTLA receives it as a parameter.
dTLA produces one of the following outputs, and it always stops (it
never hung).
* L1 and L2 are in the ambiguous relation
* L1 and L2 are NOT in the ambiguous relation
* Currently, the relation is undecidable.
--- note for people who majored in computer science ------------------
Intuitively, this algorithm computes "common language" of automaton L1
and automaton L2. This is a very classic problem of computer science. If
the common language is the empty set, we know that these two labels are
not in ambiguous relation. If the common language accepts some alphabet
sequences, then we know that these two are in ambiguous relation.
While computing the common language, we have to equate ambiguous
labels. Furthermore, these ambiguity relation cannot be considered as an
equality relation because it lacks transitivity.
----------------------------------------------------------------------
Let me first formulate the algorithm by using mathematical terms. It's
somewhat difficult to understand, but the mathematical terms gives you
the clear definition.
After that, I'll try to explain it in English.
Let the content model automaton of L1 be M as follows:
M = ( Q , S , T , q0 , F )
T is the transition function of this automaton.
And let the automaton of L2 be M' as follows:
M'= ( Q', S , T', q0', F')
Note that two automata share the same alphabet set S (S is the set of
all labels in this grammar).
This algorithm derived the following automaton from M and M'
M* = ( QxQ', S*, T*, (q0,q0'), FxF' )
where S* is a subset of SxS and defined as follows.
S* = { (s,s') | s and s' are members of S
and
( s=s' or s and s' are in the ambiguous relation) }
T* is defined as follows:
T*( (q,q'), (s,s') ) = ( T(q,s) , T'(q',s') )
If the language accepted by M* is not empty, then the algorithm says "L1
and L2 are in the ambiguous relation" and stops.
Otherwise, derives the following automaton
M+ = ( QxQ', S+, T*, (q0,q0'), FxF' )
where S+ is a subset of SxS and defined as follows.
S+ = { (s,s') | s and s' are members of S
and
( s=s' or s and s' are in the ambiguous relation,
or the relation of s and s' is not decided yet.) }
If the language accepted by M+ is empty, then the algorithm says
"L1 and L2 are not in the ambiguous relation" and stops.
Otherwise, it says "currently, the relation is undecidable" and stops.
OK, the definition part is done. Let me explain.
Before the actual explanation of this algorithm, let us think about the
very classical problem of computing the intersection of two automata M
and M'.
The automaton M+ that accepts this intersection is defined as the
following:
M+ = ( QxQ' , S, T+, (q0,q0'), FxF' )
where
T+( (q,q'), s ) = ( T(q,s), T'(q',s) )
Consider the example of transitions over this M+
(q0,q0') --- s1 ---> (q1,q1') --- s2 ---> (q2,q2')
When M+ receives {s1,s2}, its state will be (q2,q2'). By definition it
means when M receives {s1,s2}, its state will be q2, and when M'
receives {s1,s2}, its state will be q2'.
Now you see why this definition works.
Go back to talk of the algorithm. In this algorithm, we have to
identify some labels as identical. Say we have two labels X and Y, which
are in the ambiguous relation.
For the automata M and M', X and Y are alphabets. And by definition of
label ambiguity, there exists such a very nice XML fragment that it can
be interpreted by both X and Y. If we read such a fragment now, we can
perform a transition by X for M, and by Y for M' (or vice versa).
There for our M* has the alphabet of SxS. In the above example, M*
receives (X,Y) as an alphabet for transition.
Consider the following transition sequence performed by M* (easy
example). Note that s1,s1',s2,s2' are labels.
(q0,q0') ---(s1,s1)---> (q1,q1') ---(s2,s2)---> (q2,q2')
The very existence of the sequence {s1,s2} assumes that there exists a
XML hedge that can be interpreted by the sequence of {s1,s2}. Trivially
such a hedge exists.
And this transition sequence says that, by reading a sequence {s1,s2}, M
will be in q2, and M' will be in q2'.
Let's look at the another example.
(q0,q0') ---(s1,s1')---> (q1,q1') ---(s2,s2')---> (q2,q2')
This time, the assumption implied by this transition sequence is not
trivial. It assumes that there exists a hedge that can be interpreted by
{s1,s2} as well as {s1',s2'}.
However, by a closer look to the definition of S*, the existence of
such a hedge can be proved. If (s1,s1') is a member of S*, s1 and s1'
must be in the ambiguous relation. So do (s2,s2').
By definition of label ambiguity, there exists a nice XML element that
can be interpreted by both s1 and s1'. Things are the same for s2 and s2'.
As a whole, we now know that there actually exists such a nice hedge
that can be interpreted by both {s1,s2} and {s1',s2'}. In facet, it can
even be interpreted by {s1,s2'} and {s1',s2}.
And this transition sequence says that by reading this nice hedge, M
will be in q2 and M' will be in q2'.
Say L1 and L2 are in fact in the ambiguous relation (this is
a hypothesis). Then, from the definition of label ambiguity, there
exists a hedge (let it be <c1>...</c1><c2>...</c2> ... )
Its XML representation:
<c1>
...
</c1>
<c2>
...
</c2>
...
And there exists two interpretation I and I' such that the label
sequence
{ I (c1), I (c2), I (c3), ... } is accepted by M and
{ I'(c1), I'(c2), I'(c3), ... } is accepted by M'.
When we "zip" the above two sequence, we have
{ (I(c1),I'(c1)) , (I(c2),I'(c2)) , .... }
and this sequence is accepted by M+ (as your reminder, M+ treats all
unknown relation as ambiguous).
So, in short, now I proved that if L1 and L2 are in fact in the
ambiguous relation, M+ always accepts some language.
The contrapositive of this proposition will be
"if M+ doesn't accept anything, L1 and L2 are in fact NOT in ambiguous
relation".
So the first half of the proof is done. Next, I'll prove the last half.
Say M* accepts something (this is the hypothesis). Say it accepts the
following sequence.
{ (s1,s1') , (s2,s2') , ... }
By the definition of M*, if M* accepts this sequence, M accepts the
following sequence
{ s1, s2, ... }
and M' accepts the following sequence.
{ s1', s2', ... }
And from the definition of S*, we know that s1 and s1' are identical or
in the ambiguous relation.
Therefore, we know that there exists a hedge that can be interpreted by
both {s1,s2,...} and {s1',s2',...}, and an element that has this hedge
as its children can be interpreted by both L1 and L2 (since their automaton
accept this content model).
Therefore, L1 and L2 are in fact in the ambiguous relation because we
know there actually exists a tree that can be interpreted by both L1 and
L2.
So, as a whole, I proved the followings.
* If M* accepts something, then L1 and L2 are in fact in the ambiguous
relation
* If M+ accepts nothing, then L1 and L2 are in fact NOT in the ambiguous
relation
Therefore, this algorithm is proved to be correct.
regards,
----------------------
K.Kawaguchi
E-Mail: k-kawa@bigfoot.com