OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
RE: [xml-dev] converting 1-20 GB xml to xsd, visualizing on webpage

Michael Kay wrote:
1C58F5C150C343ACBF2819C5B466B958@Sealion type="cite">
In the case of the Saxon DTDGenerator, if it finds one instance where the children are PQR and another where they are RQP, then it generates the content model (P|Q|R)*.
Wouldn't (P|Q|R)* accept PQR, RQP, and along with the not-necessarily-acceptable PRQ, RPQ, QPR and QRP?
I suppose you're just giving a quick description in the above? 

Granted, it is difficult to fathom the intent of the creator from just one instance, the most a heuristic can conclude without risking over-accepting potentially unwanted patterns would just be ((P|R)Q(P|R))*.

Well, of course, the aim of a tool like this is to find the "best" pattern that matches all the instances available; and that's a completely open-ended task. If you only have a small number of instances (two in the example above) then guessing the "right" pattern is almost impossible, and on the whole I learnt from doing this that it's probably better to produce a pattern that is as simple as possible in preference to one that is the closest possible fit to the available instances. But of course there is no single right answer: you're working with incomplete information.
The Saxon tool works in streaming mode (which is important to this user) and that imposes additional constraints; it means that you can't remember all the instances that you have encountered. The strategy is to guess a content model from the first instance and then refine it as further instances are found, and because you haven't remembered details of all the instances, the only way you can refine it is to replace it by a pattern that subsumes the previous pattern. But as I said before, for most inputs the results are surprisingly close to the content model that a human author would have written.
Michael Kay

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS