Michael Kay wrote:
1C58F5C150C343ACBF2819C5B466B958@Sealion type="cite">
In the case of the Saxon DTDGenerator, if it finds one
instance where the children are PQR and another where they are RQP, then it
generates the content model (P|Q|R)*.
Wouldn't (P|Q|R)* accept PQR, RQP, and along with the
not-necessarily-acceptable PRQ, RPQ, QPR and QRP?
I suppose you're just
giving a quick description in the above?
Granted, it is
difficult to fathom the intent of the creator from just one instance, the most
a heuristic can conclude without risking over-accepting potentially unwanted
patterns would just be ((P|R)Q(P|R))*.
Well, of course, the aim of a tool like this is to find the "best"
pattern that matches all the instances available; and that's a completely
open-ended task. If you only have a small number of instances (two in the
example above) then guessing the "right" pattern is almost impossible, and on
the whole I learnt from doing this that it's probably better to produce a
pattern that is as simple as possible in preference to one that is the closest
possible fit to the available instances. But of course there is no single
right answer: you're working with incomplete information.
The
Saxon tool works in streaming mode (which is important to this user) and that
imposes additional constraints; it means that you can't remember all the
instances that you have encountered. The strategy is to guess a content model
from the first instance and then refine it as further instances are found, and
because you haven't remembered details of all the instances, the only way you
can refine it is to replace it by a pattern that subsumes the previous
pattern. But as I said before, for most inputs the results are surprisingly
close to the content model that a human author would have
written.
Michael Kay