[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*From*:**"K.Kawaguchi" <k-kawa@bigfoot.com>***To*: xml-dev@lists.xml.org*Date*: Fri, 26 Jan 2001 11:43:54 -0800

While XML Schema gains more and more weight, we have two relatively light-weight schema language. One is RELAX, and the other is TREX. And I think many people is curious about the difference of these two languages. This post proposes an algorithm to detect the ambiguity of schema written by RELAX. This is a very long post, but this is only the first quarter of the whole story(!) One of the purpose is, I really want other people's eyes to verify this algorithm. Another purpose is to prove that such an algorithm is actually possible for RELAX, whereas it's probably much harder for TREX. I'd appreciate any comments from you about this. If you are not familiar with RELAX, go to http://www.xml.gr.jp/relax/ If you don't know TREX, go to http://www.thaiopensource.com/trex/ Ambiguity Detection Algorithm for RELAX ======================================= This paper proposes an algorithm to detect ambiguity for grammars(schema) written in RELAX. In this paper, I assumes that every grammar must satisfies the following constraints. * Each element rule has an unique label name * hedge rules are not present. It is trivial to convert any valid RELAX grammar into this "normalized" form. 1. What do I mean by "ambiguous grammar". Say G is a grammar written in RELAX. Let us think about "ambiguous grammar" first. Let X be an XML instance that is valid to G. Since X is valid, verifier can assign label and role for every element in X (this step is called "type-assignment"). If G is a well-written grammar, the assignment of label/role can be done uniquely. That is, only one assignment satisfies all the constraints imposed by the grammar G. However, if G has some flaw, sometimes it is possible to assign label/role in more than one way. I'm talking about these grammars when I say "grammar is ambiguous", Formally, ambiguity of the grammar G is defined as follows: ---------------------------- There exists a valid document X, and there exists at least two label/role assignment to X ---------------------------- For example, the following grammar is ambiguous. <elementRule label="start"> <tag name="start" /> <choice> <ref label="foo.1" /> <ref label="foo.2" /> </choice> </elementRule> <elementRule label="foo.1"> <tag name="foo" /> <ref label="bar" /> </sequence> </elementRule> <elementRule label="foo.2"> <tag name="foo" /> <ref label="bar" occurs="*" /> </elementRule> <elementRule label="bar" type="string"> <tag name="bar" /> </elementRule> Because there are two different assignment for the following instance. <start> : start <foo> : foo.1? foo.2? <bar /> : bar </foo> </start> However, of course, for some documents that conforms this grammar, we can assign label/role uniquely. <start> : start <foo> : foo.2 <bar /> : bar <bar /> : bar </foo> </start> 2. Interpretation Verifier can assign label/role for every elements in an XML instance. The assigned label/role information is called "interpretation". For example, when I say "interpretation of X by G", it means the entire label/role information assigned to all elements of X. Or, when I say "interpretation of element E by G", it means the entire label/role information assigned to element E and its descendants. I'll abuse :-) the word "interpretation" further. When I say "interpretation of element E by the label L", it means the label/role assignment in which label L is assigned to element E. By using this new word "interpretation", we can define ambiguity of the grammar in much concise way. -------------------------- There exists a valid instance X and at least two different interpretation of X by G. -------------------------- 3. Ambiguity of label I've just defined the ambiguity of grammar. In this section, I discuss the ambiguous relation between two labels and defines it. What do I mean by "label L and R are in ambiguous relation"? Intuitively, there exists such a nice XML element E that E can be interpreted as L, as well as R. I call such an indistinguishable pair "ambiguous relation". We can extend this definition to define "label L is ambiguous". Informally, when there exists a nice XML element E such that it can be interpreted by L in more than one way, we call such a label ambiguous. Instead of defining "label L is ambiguous" in another way, let us define "label L is ambiguous" by "label L and L are in ambiguous relation". Formal definition follows. -------------------------- Label L and R are in ambiguous relation if and only if there exists an XML element E, and two different interpretations (one by L and one by R). -------------------------- Ambiguity of labels is important because actually this is the fundamental building block of grammar ambiguity. 4. Link between label ambiguity and grammar ambiguity Let us go back to think about an ambiguous grammar G again. Since G is an ambiguous grammar, we have X, which is an XML instance that can be interpreted by G in more than one way. We name one interpretation I, and another interpretation I'. By definition of grammar ambiguity, we can actually have such two different interpretations (I and I'). What does it mean that I and I' are different? The answer is, there exists at least one XML element E in X such that I and I' give a different label for this element. Say interpretation I gives label L to E, and I' gives R to E. If we look at this situation from another point of view, we can say that L and R are in ambiguous relation, because E can be interpreted by both L and R, and this is exactly the definition of label ambiguity! Therefore, from the above reasoning, it is proved that when the grammar is ambiguous, there always exists at least one pair of labels that are in ambiguous relation. (Note that this one pair may be a pair of the same label.) And, this proposition also holds (because this is a contraposition of the previous proposition). if no pair of labels in the grammar is in ambiguous relation, the grammar is NOT ambiguous. 5. Catalyst The bad news is, the above propositions are not the end of the story. Because the above propositions does NOT guarantee the following statement. -------------------------- If there exists an ambiguous pair of labels in the grammar, the grammar is ambiguous. -------------------------- And by considering the following example, we know that the above statement is actually not correct. <interface> <export label="root" /> </interface> <elementRule label="root"> <tag name="e" /> <sequence> <ref label="foo" /> <ref label="bar" /> </sequence> </elementRule> <elementRule label="foo"> <tag name="e" /> <empty /> </elementRule> <elementRule label="bar"> <tag name="e" /> <empty /> </elementRule> Think about XML fragment "<e/>". As you can see, this fragment can be interpreted by both "foo" and "bar". So "foo" and "bar" are in ambiguous relation. However, if we think in terms of grammar, there is only one instance that can be interpreted by this grammar. That is, <e> <e/> <e/> </e> And the interpretation for this instance is unique, as follows. <e> :root <e/> :foo <e/> :bar </e> As you can see, no other interpretation is possible. Thus the grammar is unambiguous. Let us think about a bit more about this example. When we thought in terms of labels, "foo" and "bar" are actually in the ambiguous relation. However, when we assigned label by starting from top-level label, we didn't have any chance to choose between "foo" and "bar". If we had a chance to choose, we can make this grammar ambiguous. But this grammar didn't give us such a chance. And that's why this grammar is NOT ambiguous, although it has ambiguous labels. If I modify the grammar so that we have a chance to choose, the modified grammar becomes ambiguous. See the following modified version. <elementRule label="root2"> <tag name="e" /> <choice> <ref label="foo" /> <ref label="bar" /> </choice> </elementRule> <e> : root2 <e/> : foo? bar? </e> So, what did we learn? We learned that for a grammar to be ambiguous, we need the third label, which acts as catalyst. This is the key factor that makes a grammar ambiguous. From now on, I call this third label as "catalyst". 6. Properties of catalyst If label L is a catalyst, label L is ambiguous. The proof can be easily given by the definition of ambiguous labels. And this property is perhaps the most important one. By using this property, I'd like to prove the following statement always holds. -------------------------- the grammar is ambiguous if and only if it has at least one catalyst. -------------------------- Effectively, this statement (and the above property) says that the following three bullets are equivalent. * the grammar is ambiguous * the grammar has one (or more) catalyst * the grammar has one (or more) self-ambigous label The proof is straight-forward. (1) G has self-ambiguous label L -> G is ambiguous Since L is ambiguous, we have a nice element E such that it can be interpreted in more than one way by L. So by using this E and adding some more elements, we can create a valid XML instance X, which is interpreted by more than one ways. (2) G is ambiguous -> G has self-ambigous label L Since G is ambiguous, the top-level label of G is by definition self-ambiguous. (3) G has a catalyst <-> G has a self-ambiguous label The proof is given by the previous property of catalyst. 7. Strategy of Algorithm By the above consideration, we now have a core strategy of the algorithm. That is, 1. compute the ambiguity between two labels. 2. compute the self-ambiguity of every label 3. if no label is judged self-ambiguous, then the grammar is NOT ambiguous. 4. if any label is judged self-ambiguous, then the grammar is ambiguous. 5. otherwise, the grammar is unambiguous. this proof is not trivial, so let me explain this in another mail. Consider the step-1 further by studying the following example. <elementRule label="foo" role="X"> <ref label="alpha" /> </elementRule> <elementRule label="bar" role="X"> <ref label="bravo" /> </elementRule> We have two labels and we'd like to compute if these two are in ambiguous relation. Since we have the same role for both labels, the ambiguity of these two solely depends on the ambiguity between "alpha" and "bravo". If "alpha" and "bravo" are in ambiguous relation, we can say that "foo" and "bar" is also in ambiguous relation. If they are not, then "foo" and "bar" is also NOT in ambiguous relation. So as you see, sometimes ambiguity relation between labels depends on the ambiguity relation between other labels. With this in mind, I'd like to explain the core strategy. (1) First of all, create a "score matrix". Say we have five labels (A,B,C,D, and E). The score matrix is as follows. | B| C| D| E| +--+--+--+--+-- | | | | |A +--+--+--+--+-- | | | |B +--+--+--+-- | | |C +--+--+-- | |D +--+-- Note that there are no A vs A or no B vs B. At this moment, we don't want to test self-ambiguity of the label. If the box is left blank, it means we don't know if those two are in ambiguous relation or not. When the box is marked by #, it means we DO know that those two are in ambiguous relation. And we'll mark it by '.' when we DO know that those two are NOT in ambiguous relation. (2) Compute label ambiguity. For every box that is left blank, compute the ambiguity between the two again. And mark the result as I described earlier. This algorithm is also non-trivial. So let me explain this in the separate mail. If we get any new information from this, do step (2) again. As we considered earlier, some depends on the other, so new information may help bringing an end to another pair. If no blank left, go to the next step. If some blank are left but none of them can be decided, then we are in trouble. We'll think about this situation in another mail. To cut the long story short, in this case, we can safely assume that these labels are not ambiguous. So mark all the blanks by '.' and go to step 3. (3) Find a catalyst. For every label in the grammar, check whether that label can be a catalyst or not. This algorithm will be also explained in the separate mail. If such a label is found, the entire grammar is ambiguous. If no such label is found, then the entire grammar is NOT ambiguous. regards, ---------------------- K.Kawaguchi E-Mail: k-kawa@bigfoot.com

- Prev by Date:
**Re: What is the nature of HTML 4.0? was RE: Proposal for newRDDLnatures and purpose** - Next by Date:
**Schema ambiguity detection algorithm for RELAX (3/4)** - Previous by thread:
**Re: Attribute-Value Normalization problem** - Next by thread:
**Schema ambiguity detection algorithm for RELAX (3/4)** - Index(es):