[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Are we losing out because of grammars?

From: James Clark <jjc@jclark.com>
To: "K.Kawaguchi" <k-kawa@bigfoot.com>
Date: Fri, 02 Feb 2001 11:10:02 +0700

"K.Kawaguchi" wrote:

> > <element name="x">
> >   <zeroOrMore>
> >     <element name="y">
> >       <attribute name="z">
> >         <data type="xsd:string"/>
> >       </attribute>
> >     </element>
> >   </zeroOrMore>
> >   <element name="y">
> >       <attribute name="z">
> >         <data type="xsd:integer"/>
> >       </attribute>
> >   </element>
> > </element>
> 
> The example has a typo. I guess the above one is what you are thinking
> about, right?

Right.

> Without restricting TREX's expressiveness, you can report type if you
> can parse documents twice. No random access capability is necessary.
> 
> After the first scan, the whole result of type-assignment is available,
> even if the grammar is ambiguous. You need the second scan only to feed
> the application with SAX event and type information.

A second scan is quite a big price to pay.  How do you store the result
of the type assignment? If you keep in memory, that would mean you would
be using memory proportional to the size of the input document, wouldn't
it?

> > - You seem to think type-assignment is very important.  Why?
> 
> Other people might have better reasons. Mine is:
> 
> (1) Type-assignment makes it simple to automatically generate object
>     model that in turn automatically parse the document.
>     Without type-assignment, "automatic parsing" process is much more
>     complicated.

I can believe that, and I accept that this is an important application. 
But it is just one application.  I also wonder for such an application
whether it wouldn't be acceptable simply to prohibit ambiguity so that
type assigment becomes trivial.

> (2) As an application programmer, I don't want to check the ancestor's
>     information when deciding what to do with current element.
> 
> > dispatching on the "FQGI" (ie on the name of the element and the names
> > of its ancestor elements) is sufficient for many applications.  Type
> 
> In other words, I think it's sufficient too, but I don't even want to
> see the ancestor information. If I can receive type, all I need to see is
> the type.

Here I don't agree.  In general I think I may need to look at ancestor
information even if I know the type. Imagine, for example, a title
element that can occur as a child of a div1, div2, or div3. The title
element might have the same content model (and hence the same type) in
all cases, yet I may well want different processing (eg a different font
size) in each case.

> > - Your ambiguity detection algorithm for RELAX detects whether it is
> > possible to assign labels to elements in more than one way. I would find
> > it more interesting to know whether it is possible to assign datatypes
> > (as specified by the RELAX "type" attribute) to leaf elements and
> > attributes in more than one way.  Is it possible/easy to detect this
> > kind of ambiguity?
> 
> I understand that you are interested in the algorithm that answers the
> following question by Yes/No.
> 
> "Is there any lexical value that can be accepted by given two datatypes?"

That's not quite what I'm after.  I'm more interested in reporting
datatypes than I am in reporting labels, so I want to know whether I can
report datatypes unambiguously regardless of whether I can report labels
unambiguously.  For example, given the following TREX pattern

<choice>
  <element name="e">
    <attribute name="a">
      <data type="xsd:integer"/>
    </attribute>
  </element>
  <element name="e">
    <attribute name="a">
      <data type="xsd:string"/>
    </attribute>
  </element>
</choice>

and the following instance

<e a="7"/>

it would be ambiguous whether the value of "a" should be reported as an
integer or a string.  On the other hand:

<choice>
  <element name="e">
    <attribute name="a">
      <data type="xsd:integer"/>
    </attribute>
  </element>
  <element name="e">
    <attribute name="a">
      <data type="xsd:integer"/>
    </attribute>
    <optional>
      <attribute name="b"/>
    </optional>
  </element>
</choice>

is not ambiguous in the sense I'm interested in, even though it would I
think be ambiguous in the sense you were discussing.  The following
would also not be ambiguous in the sense I'm talking about:

<element name="x">
  <element name="e">
    <attribute name="a">
      <data type="xsd:integer"/>
    </attribute>
  </element>
  <element name="e">
    <attribute name="a">
      <data type="xsd:string"/>
    </attribute>
  </element>
</element>

Given an instance:

<x><e a="7"/><e a="7"/></x>

the first "a" attribute can be unambiguously reported as xsd:integer,
the second "a" attribute as xsd:string.

James

Follow-Ups:
- Re: Are we losing out because of grammars?
  - From: "K.Kawaguchi" <k-kawa@bigfoot.com>

References:
- Re: Are we losing out because of grammars?
  - From: James Clark <jjc@jclark.com>
- Re: Are we losing out because of grammars?
  - From: "K.Kawaguchi" <k-kawa@bigfoot.com>

Prev by Date: Re: Type-assignment
Next by Date: Re: Are we losing out because of grammars?
Previous by thread: Re: Are we losing out because of grammars?
Next by thread: Re: Are we losing out because of grammars?
Index(es):
- Date
- Thread