There are infinitely many schemas that will match any given set of data,
so there is no single schema to extract....
If you just want *some* schema under which the document(s) is(are) valid,
you could just extract all the element types that occur, say with something
like
grep -o '<[-_.:a-zA-Z0-9]* ' documentname.xml
| sort | uniq
and then create a declaration for each one using some global changes in
an editor, that allows each one to have unrestricted content. You'd need to do
something similar for attributes, but then it should all validate.
If you want a little more information so you can build a more detailed
schema, my xmlstats utility (in Perl) at
http://derose.net/steve/utilities/xmlstats has options to tell you what
element types occur within what other ones, and from that you could derive a
more restrictive schema. The most obvious one would be to declare each element
to permit the OR of all the element types that ever occur in it; that misses
useful restrictions, such as for example that TITLE must occur only once in
each DIV, and be the first child element; but it's better than just ANY for
everything.
There used to be some nice utilities for extracting a reasonable DTD from
SGML documents; perhaps someone has one handy for XML?
Steve
At 7:24 AM +0200 10/20/08, Farkas, Illes wrote:
Dear List Members,
Do you happen to know of a linux tool (or
tools) that can extract from a 1-20 GB xml file its schema and visualize for
users similarly to this page: http://psidev.sourceforge.net/mi/rel25/doc/
Thanks in advance,
Illes Farkas, Ph.D.
http://angel.elte.hu/fij
--
Steve DeRose -- http://www.derose.net, email
sderose@acm.org