[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Are we losing out because of grammars?

From: "K.Kawaguchi" <k-kawa@bigfoot.com>
To: James Clark <jjc@jclark.com>
Date: Thu, 01 Feb 2001 13:05:51 -0800


> <element name="x">
>   <zeroOrMore>
>     <element name="y">
>       <attribute name="z">
>         <data type="xsd:string"/>
>       </attribute>
>     </element>
>   </zeroOrMore>
>   <element name="y">
>       <attribute name="z">
>         <data type="xsd:integer"/>
>       </attribute>
>   </element>
> </element>

The example has a typo. I guess the above one is what you are thinking
about, right?

> unless I lookahead and see whether it's the last element "y" element in
> the "x".   The TREX implementation works on a stream of SAX events, so
> this is a big complication.

Right. But it's not so big a complication.

> It's not in general easy, unless you restrict the grammar.

Without restricting TREX's expressiveness, you can report type if you
can parse documents twice. No random access capability is necessary.

After the first scan, the whole result of type-assignment is available,
even if the grammar is ambiguous. You need the second scan only to feed
the application with SAX event and type information.


> Type assignment may require quite different implementation
> techniques from validation.

No, I understand you may want to see it before believe it. But believe
me, I did it once by myself :-)





> - You seem to think type-assignment is very important.  Why?

Other people might have better reasons. Mine is:

(1) Type-assignment makes it simple to automatically generate object
    model that in turn automatically parse the document.
    Without type-assignment, "automatic parsing" process is much more
    complicated.

(2) As an application programmer, I don't want to check the ancestor's
    information when deciding what to do with current element.

> dispatching on the "FQGI" (ie on the name of the element and the names
> of its ancestor elements) is sufficient for many applications.  Type

In other words, I think it's sufficient too, but I don't even want to
see the ancestor information. If I can receive type, all I need to see is
the type.

For example, in RELAX, "tag" element has two different definition,
depending on where it appears. If you see it under "elementRule" tag,
then it has

<!ATTLIST        name   CDATA #required>

whereas if you see it elsewhere, then it has

<!ATTLIST        role   CDATA #required
                 name   CDATA #implied>




> - Your ambiguity detection algorithm for RELAX detects whether it is
> possible to assign labels to elements in more than one way. I would find
> it more interesting to know whether it is possible to assign datatypes
> (as specified by the RELAX "type" attribute) to leaf elements and
> attributes in more than one way.  Is it possible/easy to detect this
> kind of ambiguity?

I understand that you are interested in the algorithm that answers the
following question by Yes/No.

"Is there any lexical value that can be accepted by given two datatypes?"

My answer depends on why you need this algorithm.


Actually, label ambiguity depends on datatype ambiguity. But I
intentionally left them unexplained in my post, so that the core idea of
the algorithm is easily understood.

Computation of the intersection of two lexical spaces may or may not be
decidable (in the sense of computer science). Even if it is decidable,
the algorithm has to be very dependent to datatype spec.

However, some sound algorithms (again in the sense of CS) can be easily
implemented, and it is reasonably practical, I think.

For example, that algorithm can always answer the question if

- they are both derived from decimal type, without pattern facet.
or
- they are both derived from string type
  (some restriction apply)
or
- one of the type has finite lexical space (enumeration facet)

And I think these simplified algorithm covers 80% of use cases.


If the grammar is unambiguous, you never have to worry about the
intersection of two datatypes. The following example is always
unambiguous regardless of type1 & type2, and as a result you don't have
to compute the intersection at all.

<element name="y">
  <choice>
    <group>
      <attribute name="z" type="type1" />
      <element name="a" />
    </group>
    <group>
      <attribute name="z" type="type2" />
      <element name="b" />
    </group>
  </choice>
</element>

As a result, The algorithm is rarely used, and thus far less important.

regards,
----------------------
K.Kawaguchi
E-Mail: k-kawa@bigfoot.com

Follow-Ups:
- Re: Are we losing out because of grammars?
  - From: James Clark <jjc@jclark.com>

References:
- Re: Are we losing out because of grammars?
  - From: James Clark <jjc@jclark.com>

Prev by Date: Re: XML versus Relational Database
Next by Date: Re: Are we losing out because of grammars?
Previous by thread: Re: Are we losing out because of grammars?
Next by thread: Re: Are we losing out because of grammars?
Index(es):
- Date
- Thread