OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Icebergs - XML file metrics



See the paper "XML and Object-Relational Databases - Enhancing
Structural Mappings Based on Statistics" on
http://wwwdb.informatik.uni-rostock.de/xml/. They've done some work on
trying to characterize DTDs to figure out how to best store data in XML
documents in databases. It might not be an exact match, but I suspect
it's related.

-- Ron

Robin LaFontaine wrote:
> 
> Can anyone help with this: Is there a way of 'profiling' an XML file
> to indicate its characteristics?
> 
> We test our XML comparators on large files, but a 5Mb XML file could
> have twenty XML tags or 20,000 and it could be deeply nested or flat.
> So, are there any metrics to help in this characterization?
> 
> Seems sensible to use ratios as far as possible, so that they are
> comparable for different file sizes, perhaps:
> 
> 1. File size (not a ratio)
> 
> 2. No. of elements / file size in kb = no. of elements/kb (or Mb perhaps?)
> 
> 3. No. of attributes / no. of elements = no. of attributes/element
> 
> 4. No. of text nodes / no. of elements = no. of text nodes/element
> 
> 5. No. of text nodes / no. of unique text nodes = text re-use index
> 
> 6. No. of attribute values / no. of unique attr. values = attribute
> value re-use index
> 
> 7. (sum for each element of no. of ancestors for the element) / no.
> of elements = Average depth (iceberg factor).
> 
> Last one indicates nesting depth, e.g.
> <a> <b/><b/><b/><b/></a> = (0+1+1+1+1)/5 = 0.8
> 
> <a> <b><b><b><b></b></b>/<b></b> </a> = (0+1+2+3+4)/5 = 10/5 = 2
> 
> <a> <b><b><b><b> <b><b><b><b> </b></b>/<b></b> </b></b>/<b></b> </a>
> = (0+1+2+3+4+5+6+7+8)/5 = 36/9 = 4
> 
> Perhaps someone has already developed a different set of metrics.
> 
> Robin
> -- -----------------------------------------------------------------
> Robin La Fontaine, Monsell EDM Ltd
> (XML file comparison, Engineering data exchange and management using
> XML, R&D Project Management)
> Tel: +44 1684 592 144 Fax: +44 1684 594 504
> Email: robin@monsell.co.uk      http://www.deltaxml.com
> 
> ------------------------------------------------------------------
> The xml-dev list is sponsored by XML.org, an initiative of OASIS
> <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To unsubscribe from this elist send a message with the single word
> "unsubscribe" in the body to: xml-dev-request@lists.xml.org

-- 
Ronald Bourret
Programming, Writing, and Training
XML, Databases, and Schemas
http://www.rpbourret.com