XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] generate common xml shema from multiple xml instances

I think a Weighted Likelihood algorithm of some sort would be usable
(although what the weights should be I'm unsure), for example
something like:

User says they would like to Generate Enumerations - this could
actually be used as how would you like to Generate Enumerations -
never generate enumerations, only generate enumerations if highly
certain, generate enumerations where likely, Always Generate
Enumerations..

Check if node for set of documents is likely to be enumeration,
variations to check for would be:

No Whitespace - most enumerations are non whitespace, give user
opportunity to allow whitespace in enumeration. If whitespace found
and no whitespace allowed in enumeration then type is just string.

If there is whitespace allowed I would also note that the the node
would probably still have some regularity that could be used to
determine if it was likely to  be an enumeration.


Do values repeat in any of these documents:
if for example we have a set with 100 nodes with all different values
we could have an enumeration with 100 values, but if the nodes
sometimes repeat values that would increase the chance of it being an
enumeration. I think the way the question was first phrased would be
to be able to generate enumerations based on giving something like

RED
GREEN
ORANGE
BLUE

This gives us pretty good clues to guess it is an enumeration - one
the size of the data in each instance is pretty close to each other,
they are all letters in  a particular alphabet, they are words
(wordnet turns them up)

I think an algorithm could be written to make this an enumeration
pretty easily. However in situations like this you like it if your
inputs can guide you, if the rule of the application was that

RED
GREEN
ORANGE
BLUE
RED

Has a 20% higher chance of being an enumeration than
RED
GREEN
ORANGE
BLUE
that would be useful.






On Thu, Jun 18, 2009 at 8:30 AM, Paul
Spencer<xml-dev-list@boynings.co.uk> wrote:
> XML to schema tools tend to allow the user to set various options. For example, XML Spy asks if you want to create enumerations or not. If you had a single instance, you might just get
>
> <xs:simpleType name="Color">
>  <xs:restriction base="xs:string">
>    <xs:enumeration value="RED" />
>  </xs:restriction>
> </xs:simpleType>
>
> As Mukul says, it is then up to the schema author to fix this. I saw the original message as trying to improve this "first cut" by taking account of several instance documents. The complexity of the tool goes up markedly with more than one instance. For example, how would you handle this:
>
> Instance 1
> <a>
>  <b/>
>  <c/>
> </a>
>
> Instance 2
> <a>
>  <b/>
>  <d/>
> </a>
>
> Is there a choice between c and d? Or are both optional? If optional, which order do they go in?

I think this part actually comes into the thing stated above it being
a weighted likelihood algorithm and the choice of user inputs.

When you say: The complexity of the tool goes up markedly with more
than one instance.
you mean not just the complexity of programming but also the
complexity of making the right choice, but these should probably be
separated.

The complexity of programming goes up markedly with more than one
instance, but the complexity of making the right choice goes down
after some point if there is some guiding repetition:


 Instance 1
 <a>
  <b/>
  <c/>
 </a>

 Instance 2
 <a>
  <b/>
  <d/>
 </a>

Instance 3
 <a>
  <b/>
  <c/>
  <d/>
 </a>

furthermore just as you can ask the user  - do you want to make
enumerations - you can ask do you want to generate choices

the choice is generally less used than cardinality games therefore I
suppose the default would be to go to minOccurs= 0
the problem then is, as you noted, how to get things to be in the
right order. The right order problem decreases if we have multiple
instances that adumbrates the order.

The problems with this are of course -
1. if it takes 100 instances to get something good out you would
probably write a schema :)
2. The likely UI for a tool that allowed you to do this would be crap,
or it would be a command line tool which would be pretty sweet I
guess.

finally - I think actually it would be easier to write something that
generated Schematron schemas based on multiple outputs that handled a
weighted likelihood algorithm in the generation than something that
generated XSD - or maybe I just mean it would be more enjoyable.

Cheers,
Bryan Rasmussen


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS