OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] converting 1-20 GB xml to xsd, visualizing on webpage

So you didn't truncate your explanation in the original post but actually meant it... then I shall delve further :)

Basically it all stems from what strategy you take: whether simplicity comes first or right result comes first. 

In the extreme, the simplest form  is a schema that accepts everything (.)* , which is pretty useless.  The other extreme is a schema which rightfully accepts only that single instance (PQR|RQP), which is probably useless most of the time as well.  Other strategies straddle in between, pleasing some users and driving others crazy.  But as you said, working with only one instance is just not the  best way to extrapolate or make assumptions about the "kinds" of instance that instance's author wants in general.  Still, if only one instance is all that we have to work with,  a user receiving the generated schema might need to hand-make another schema to prune out PRQ, RPQ, QPR & QRP, which is twice the amount of work had he started with a constructive assembly of schema for PQR and RQP.  (as you might note, the combinatorial factor exponentiates with more siblings)

On streaming mechanism, I find it rather bold that when the DTD generator  sees PQR, it assumes (P|Q|R)* right away and forgets about PQR (as it needs to conserve memory), then hoping in future to find RQP, PRQ, RPQ, QPR & QRP.  In the instance under discussion, the DTD generator finds RQP and happily lives with the decision of (P|Q|R)* ever after.  Can't say it is right or wrong as it is a means of helping the user identify potential pattern, which might just be the right pattern after all.   Still, I find it rather bold...

Chin Chee-Kai

Michael Kay wrote:
328FCA5800394E8F8F327A595DDC7AD3@Sealion" type="cite">
Michael Kay wrote:
1C58F5C150C343ACBF2819C5B466B958@Sealion" type="cite">
In the case of the Saxon DTDGenerator, if it finds one instance where the children are PQR and another where they are RQP, then it generates the content model (P|Q|R)*.
Wouldn't (P|Q|R)* accept PQR, RQP, and along with the not-necessarily-acceptable PRQ, RPQ, QPR and QRP?
I suppose you're just giving a quick description in the above? 

Granted, it is difficult to fathom the intent of the creator from just one instance, the most a heuristic can conclude without risking over-accepting potentially unwanted patterns would just be ((P|R)Q(P|R))*.

Well, of course, the aim of a tool like this is to find the "best" pattern that matches all the instances available; and that's a completely open-ended task. If you only have a small number of instances (two in the example above) then guessing the "right" pattern is almost impossible, and on the whole I learnt from doing this that it's probably better to produce a pattern that is as simple as possible in preference to one that is the closest possible fit to the available instances. But of course there is no single right answer: you're working with incomplete information.
The Saxon tool works in streaming mode (which is important to this user) and that imposes additional constraints; it means that you can't remember all the instances that you have encountered. The strategy is to guess a content model from the first instance and then refine it as further instances are found, and because you haven't remembered details of all the instances, the only way you can refine it is to replace it by a pattern that subsumes the previous pattern. But as I said before, for most inputs the results are surprisingly close to the content model that a human author would have written.
Michael Kay

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS