Its "trivial" yes, but its not "right" IMHO :) Nor is it necessarily efficient. I wouldn't bet a case of beer that for a large value of attribute x that points = fn:tokenize( $x , "[ ,]") is more efficient then for a node x with point children points = $x/point I can imagine in some processors for some size of $x one or the other is more efficient. But is that a reason to make the design decision for a (potentially) widely used standard schema ? This is a serious question, not rhetorical. ---------------------------------------- David A. Lee dlee@calldei.com http://www.xmlsh.org From: Kurt Cagle [mailto:kurt.cagle@gmail.com] Sent: Friday, June 03, 2011 11:45 AM To: David Lee Cc: Michael Sokolov; Andrew Welch; John Cowan; Pete Cordell; Mukul Gandhi; stephengreenubl@gmail.com; Jesper Tverskov; xml-dev@lists.xml.org Subject: Re: [xml-dev] HTML5 and almost no namespaces David, I brought up the very question of point set optimization with the SVG working group when the SVG 1.0 spec was still in development. Adobe was essentially calling the shots at that point with the only real working implementation, and they found that for their processing parsing lists of points was preferable to querying an XML document with sets of nodes. In retrospect, they were probably right - even in XQuery, retrieving point lists is relatively trivial. Managing Editor, XMLToday.org
On Fri, Jun 3, 2011 at 9:09 AM, David Lee <dlee@calldei.com> wrote: Agree 50% . Certianly you can optimize a tagset for a particular processor.
But does that mean you *should* ?
Once you go down the route of optimizing your XML for a particular processor all sorts of tricks become useful. For example MarkLogic works best on lots of small documents instead of very large ones, so for optimization I split up my 500MB XML file into about a million small ones. Other processors have other tricks needed to get them to work optimally.
My personal opinion is that shouldn't dictate the source schema design. But rather be a post-processing phase optimized for a particular processor. Micro-designing XML schema for optimization on one processor can eventually bite you... say when you change processors or they come out with new performance characteristics in V(n+1).
A good non-processor-specific example is SVG. I just started using SVG this month as an experiment and am 'horrified' that it 'abuses' attributes to represent lists of points. A single graph might have a hundred thousand points stored in a single attribute value ! While I wasnt there when it was invented, I can guess that this was done with the eye to compactness/optimization with the assumption that small is better. i.e.
<svg:polyline points="1 0,2 120.46,3 97.95,4 104.97,5 124.5,6 97.81,7 97.94,8 92.37,9 100.15,10 99.2,11 .... 1000000 bytes later ... "/>
This is certainly more *compact* then
<svg:polyline> <p x="1" y="0"/> .... 1000000 bytes later </svg:polyline>
But is it *better* ? I actually found an article about EXI discussing this exact issue
http://www.svgopen.org/2010/papers/3-Compressing_SVG_with_EXI/index.html
I find this a good example to demonstrate the woes of prematurely optimizing source data formats for assumption of performance.
And consequently I propose that in general one should not do that. But rather design an XML schema for clarity not performance on a particular version of a particular processor (or imagined one in the case above).
You can *usually* post-process data to be optimized for your current processor at the point of injest rather than make the world suffer with predictive optimization.
(by "usually" I mean there are always exceptions. No statement is always right, even this one)
-David On 6/2/2011 10:22 PM, David Lee wrote: > I do ( use MarkLogic ) > And it appears to work perfectly fine using context sensitive duplicate names > It's true that if you want to fine tune fragmentation or create special range indexes it bites you but overall I've had no problems > > > Sent from my iPad (excuse the terseness) That's ok David - after all, brevity is the soul of wit, as the bard put it. Still it is the case that MarkLogic's built-in term indexes (not the range ones) are based on element (and attribute) names, and although there are also contextual (parent/child) indexes, you will not get best performance there if you rely on context sensitivity; eg queries for //name can be resolved straight out of the indexes accurately and don't require additional filtering, wheras //person/name and //place/name require (some) extra processing. For example, to get an accurate count there, ML has to filter every possible result returned by the indexes. ML is spiffy and does this really fast, so you usually don't notice, but if you have 1M docs and want to know exactly how many have a person name "Lee", you really will notice the difference.
I'm not trying to run down MarkLogic - it's a great system for XML work; merely pointing out that in some cases practical considerations that have little to do with semantic correctness may inform the design of your tag set.
-Mike
_______________________________________________________________________
XML-DEV is a publicly archived, unmoderated list hosted by OASIS to support XML implementation and development. To minimize spam in the archives, you must subscribe before posting.
[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/ Or unsubscribe: xml-dev-unsubscribe@lists.xml.org subscribe: xml-dev-subscribe@lists.xml.org List archive: http://lists.xml.org/archives/xml-dev/ List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
_______________________________________________________________________
XML-DEV is a publicly archived, unmoderated list hosted by OASIS to support XML implementation and development. To minimize spam in the archives, you must subscribe before posting.
[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/ Or unsubscribe: xml-dev-unsubscribe@lists.xml.org subscribe: xml-dev-subscribe@lists.xml.org List archive: http://lists.xml.org/archives/xml-dev/ List Guidelines: http://www.oasis-open.org/maillists/guidelines.php |