XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
RE: [xml-dev] generate common xml shema from multiple xml instances

> I further see following issues with the usefulness of XML to 
> XSD conversion tools.
> 
> 1) Suppose a following element exists in the XML document.
> 
> <color>RED</color>
> 
> How would the "XML to Schema" conversion tool guess, that the 
> element "color" represents a "visual attribute of things" and 
> generate a simple type declaration like below:
> 
> <xs:simpleType name="Color">
>   <xs:restriction base="xs:string">
>     <xs:enumeration value="RED" />
>     <xs:enumeration value="GREEN" />
>     <xs:enumeration value="YELLOW" />
>   </xs:restriction>
> </xs:simpleType>
> 
> Which the Schema author may want to do.
> 
> In the abscence of this semantic intelligence, the Schema 
> generation tool may generate a Schema declaration like following:
> 
> <xs:element name="color" type="xs:string" />

Of course the tool can't have any semantic intelligence, but it's very easy
to implement a heuristic that will generate an enumeration in most cases
where it is appropriate. Saxon's DTDGenerator does it if the number of
distinct values of an attribute is less than 20, and the number of instances
of the attribute is more than 3 times the number of distinct values and more
than 10. No heuristic like this will get the right answer every time, but
this isn't an exercise in getting the right answer, it's an exercise in
getting a schema that is sufficiently useful as a starting point for
hand-tuning.

> 
> 2) It may be difficult for the tool to reuse type 
> definitions. In case of structural similarities in a large 
> XML document, or a set of XML documents, the tool may 
> generate lot of Schema types, which the Schema author may 
> like to refactor.

Yes, with a DTD generator I didn't have to tackle that one, but it's true
enough that this is another challenge. However, it's again true that it
should be possible to define a simple similarity metric over two sets of
values to decide whether they are sufficiently similar to justify using the
same type, or indeed two types one of which is a subtype of the other. 

Incidentally, it's quite possible to use attribute and element names as
another heuristic. If an attribute name starts or ends in "date" then
there's a fairly good chance it holds a date.

> 
> Though I believe, the XML to Schema conversion tools may be 
> useful to quickly generate a Schema, which could be further 
> enahanced and refactored by the Schema author.
> 

Yes, a schema generated from an instance - even from a large collection of
instances - is never going to be perfect. But it can be surprisingly good.

Regards,

Michael Kay
http://www.saxonica.com/
http://twitter.com/michaelhkay 



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS