[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Schema ambiguity detection algorithm for RELAX (4/4)

From: "K.Kawaguchi" <k-kawa@bigfoot.com>
To: xml-dev@lists.xml.org
Date: Fri, 26 Jan 2001 11:44:09 -0800

This is the final leg of our long journey. And fortunately, this is the
easiest of the four.


In this post, the algorithm that detects if a label is a catalyst is
proposed. This algorithm is named "detectCatalyst" and "dC" in short.


In the first mail, the definition of catalyst is somewhat abstract.

(for your reminder)
> So, what did we learn? We learned that for a grammar to be ambiguous, we
> need the third label, which acts as catalyst. This is the key factor
> that makes a grammar ambiguous.
> From now on, I call this third label as "catalyst".


First, I'll define catalyst in more formal way. Then, I'll prove that
this definition complies the following requirement of catalyst.

> the grammar is ambiguous if and only if it has at least one catalyst.

After that, I'll propose an algorithm to detect a catalyst. Finally,
I'll prove that the algorithm is correct.



1. Formal definition of catalyst

OK. The first part will be the formal definition of catalysts. Say we
have two DIFFERENT sequences of labels, and say their lengths are the
same.

{ s1 , s2, ... , sn }
{ s1', s2', ... , sn' }

Say for every pair (si,si'), either si=si' or si and si' are in the
ambiguous relation. Since they are two different sequences, we have at
least one pair such that si!=si' (but they are in the ambiguous relation).

The formal definition of catalysts follows:

The label L is a catalyst if and only if 
(1) there exists two such sequences, and
(2) label L can accepts both of them




2. Consistency of the formal definition with respect to the requirement

The second part is to prove that this definition satisfies the following
requirement.

> the grammar is ambiguous if and only if it has at least one catalyst.

Say the label L is a catalyst. From the definition, we know that there
actually exists two such sequences.

Every pair (si,si') of two sequences, there exists a tree that can be
interpreted by both si and si' (if si=si' then this tree is trivial. if
si!=si' then the existence is guaranteed by the ambiguous relation).

By horizontally combining those trees, we can say that there exists a
hedge that can be interpreted by both sequences. And by definition of
catalyst, label L can accept both interpretation.
Therefore, the entire grammar is ambiguous, by the definition of grammar
ambiguity.

So, I just proved that if there exists a catalyst in a grammar, it is an
ambiguous grammar.

The rest is to prove that if a grammar is ambiguous, there exists a
catalyst in it.

Say the grammar G is ambiguous. In the first post, I proved that if G is
ambiguous, there exists at least one self-ambiguous label. Name it L.
By definition of self-ambiguous label, we can say that there exists a
hedge X such that X can be interpreted in more than one way (I and I'),
and both I and I' are acceptable by L.

These I and I' satisfies the two requirements of being "two sequences"
that I described in the first part of this post. Therefor, L is a
catalyst.

As a whole, I proved that the formal definition of catalyst is in fact
well-defined one, and consistent with the requirement of the first post.




3. "detectCatalyst" algorithm.

The third part proposes actual algorithm to detect if a label is a
catalyst or not. I name it "detectCatalyst" or "dC" in short.

dC receives the followings as input.

(1) Label L, which is going to be inspected.
(2) Score matrix A, which is (maybe not-fully) completed by iterative
execution of dTLA.

We know that some fields of the score matrix can be still blank.

dC produces one of the following output and always stops.

(1) Label L is a catalyst
(2) Label L is not a catalyst


Intuitively, this algorithm finds the "two sequences" that I described
in the previous section.

Let the content model automaton of L be M as follows:

M = ( Q, S, q0, F )

This algorithm constructs the following automaton M* from M.

M* = ( QxQ, S*, T*, (q0,q0), FxF )

S* = { (s,s') | s,s' are members of S, and
                       ( s=s' or s and s' are in the ambiguous relation) }

T*( (q,q'), (s,s') ) =  (  T(q,s),  T(q',s')  )

And let us write this M* down to some big whiteboard. When writing a
edge of (s,s), use a black pen please. When writing a edge of the form
of (s,s'), please use a red pen.

This algorithm says "L is a catalyst" if there exists a path from (q0,q0)
to a final state that uses at least one red edge.

If no such path exists, the algorithm says "L is not a catalyst" and
stops.

Many algorithm can be possible to detect the existence of such a path.
One way to do is minimize M* and see if we have any red edge in the
result. If such an edge exists, such a path exists. If the edge doesn't
exist, neither does the path.




4. Why does this algorithm works?

Say dC said "L is a catalyst" and stopped. This only occurs if there
exists a sequence or label pair

{  (s1,s1')  ,  (s2,s2')  ,  (s3,s3')  , ... }

And this sequence is accepted by M*. And we know that at least one of
the pair is a red edge. That means si!=si' for a certain i but si and si'
are in the ambiguous relation.

By "unzipping" this sequence, we obtain two sequences of the same length.

{ s1, s2, ... }
{ s1', s2', ... }

From the way I defined T*, we can say that both sequences can be
accepted by M. So these two sequences are the "two sequences" that
defines L being in fact a catalyst.

In summary, I proved that if dC says "it's a catalyst", then it actually
is.


Next, I'll prove that if it is actually a catalyst, the algorithm always
says "it's a catalyst".

If L is actually a catalyst, there exists the two sequences

{ s1, s2, ... }
{ s1', s2', ... }

Every pair (si,si') is either equal or in the ambiguous relation. And at
least one pair is not equal.

Furthermore, both of the sequence can be accepted by M by the following
transition sequences.

 q0 --s1---> q1  --s2---> q2  ....   qf
 q0 --s1'--> q1' --s2'--> q2' ....   qf'

Since both are accepted by M, both qf and qf' are members of F.

Let's "zip" those two sequences into one sequence.

{  (s1,s1'),  (s2,s2'),  ... }

What happens to M* if it receives this sequence? The transition sequence
of M* will be as follows:


(q0,q0) --(s1,s1')-->  (q1,q1')  --(s2,s2')-->  (q2,q2')

              ...  --> (qf,qf')

As I mentioned earlier, both qf and qf' are members of F. So (qf,qf') is
the final state of M*, which means this sequence is accepted by M*.

And the fact that at least one pair (si,si') is not equal guarantees
that this transition uses a red edge for such pair.
From the second post, we know that our score matrix is never blank for
those (si,si').

So, in cases like this, the algorithm always stops by saying "L is a
catalyst".

This proves that if it is actually a catalyst, the algorithm always
says "it's a catalyst".



As a whole, I proved to thing

* dC says it's a catalyst  -> It actually is.
* It actually is a catalyst -> dC says so.

And these two guarantees that dC is a correct (sound as well as complete)
algorithm.


regards,
----------------------
K.Kawaguchi
E-Mail: k-kawa@bigfoot.com
Prev by Date: Schema ambiguity detection algorithm for RELAX (3/4)
Next by Date: Schema ambiguity detection algorithm for RELAX (2/4)
Previous by thread: Schema ambiguity detection algorithm for RELAX (3/4)
Next by thread: Schema ambiguity detection algorithm for RELAX (2/4)
Index(es):
- Date
- Thread