So you didn't truncate your explanation in the original post but
actually meant it... then I shall delve further :)
Basically it all stems from what strategy you take: whether simplicity
comes first or right result comes first.
In the extreme, the simplest form is a schema that accepts everything
(.)* , which is pretty useless. The other extreme is a schema which
rightfully accepts only that single instance (PQR|RQP), which is
probably useless most of the time as well. Other strategies straddle
in between, pleasing some users and driving others crazy. But as you
said, working with only one instance is just not the best way to
extrapolate or make assumptions about the "kinds" of instance that
instance's author wants in general. Still, if only one instance is all
that we have to work with, a user receiving the generated schema might
need to hand-make another schema to prune out PRQ, RPQ, QPR & QRP,
which is twice the amount of work had he started with a constructive
assembly of schema for PQR and RQP. (as you might note, the
combinatorial factor exponentiates with more siblings)
On streaming mechanism, I find it rather bold that when the DTD
generator sees PQR, it assumes (P|Q|R)* right away and forgets about
PQR (as it needs to conserve memory), then hoping in future to find
RQP, PRQ, RPQ, QPR & QRP. In the instance under discussion, the
DTD generator finds RQP and happily lives with the decision of (P|Q|R)*
ever after. Can't say it is right or wrong as it is a means of helping
the user identify potential pattern, which might just be the right
pattern after all. Still, I find it rather bold...
regards,
Chin Chee-Kai
Michael Kay wrote:
328FCA5800394E8F8F327A595DDC7AD3@Sealion"
type="cite">
Michael Kay wrote:
1C58F5C150C343ACBF2819C5B466B958@Sealion"
type="cite">
In the case of the Saxon
DTDGenerator, if it finds one instance where the children are PQR and
another where they are RQP, then it generates the content model
(P|Q|R)*.
Wouldn't (P|Q|R)* accept PQR, RQP, and along with the
not-necessarily-acceptable PRQ, RPQ, QPR and QRP?
I suppose you're just giving a quick description in the above?
Granted, it is difficult to fathom the intent of the creator from just
one instance, the most a heuristic can conclude without risking
over-accepting potentially unwanted patterns would just be
((P|R)Q(P|R))*.
Well, of course, the aim of a tool like this is
to find the "best" pattern that matches all the instances available;
and that's a completely open-ended task. If you only have a small
number of instances (two in the example above) then guessing the
"right" pattern is almost impossible, and on the whole I learnt from
doing this that it's probably better to produce a pattern that is as
simple as possible in preference to one that is the closest possible
fit to the available instances. But of course there is no single right
answer: you're working with incomplete information.
The Saxon tool works in streaming mode (which is
important to this user) and that imposes additional constraints; it
means that you can't remember all the instances that you have
encountered. The strategy is to guess a content model from the first
instance and then refine it as further instances are found, and because
you haven't remembered details of all the instances, the only way you
can refine it is to replace it by a pattern that subsumes the previous
pattern. But as I said before, for most inputs the results are
surprisingly close to the content model that a human author would have
written.
Michael Kay
|