xml-dev - Re: Streaming XML (WAS: More on taming SAX (was Re: [xml-dev] ANN

Re: Streaming XML (WAS: More on taming SAX (was Re: [xml-dev] ANN

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Re: Streaming XML (WAS: More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0))
From: "Dimitre Novatchev" <dnovatchev@yahoo.com>
Date: Wed, 29 Dec 2004 11:56:25 +1100
References: <830178CE7378FC40BC6F1DDADCFDD1D10276723C@RED-MSG-31.redmond.corp.microsoft.com> <30291DBF-590E-11D9-A33A-000393DC762C@mac.com>
Sender: news <news@sea.gmane.org>

Why I think Daniela Florescu is right?

Please, bear with my style, which has nothing to do with SAX and any kind of 
APIs mentioned in this thread. Just read on, I promise you'll agree that my 
message is relevant.

This is the code of the f:foldl-tree() function, which is part of the FXSL 
library:

<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:f="http://fxsl.sf.net/";
xmlns:int="http://fxsl.sf.net/int/folfl-tree";
exclude-result-prefixes="f int"
>
    <xsl:import href="func-apply.xsl"/>

    <xsl:function name="f:foldl-tree">
      <xsl:param name="pFuncNode" as="element()"/>
      <xsl:param name="pFuncSubtrees" as="element()"/>
      <xsl:param name="pA0"/>
      <xsl:param name="pNode" as="element()"/>

      <xsl:choose>
         <xsl:when test="not($pNode)">
            <xsl:copy-of select="$pA0"/>
         </xsl:when>
         <xsl:otherwise>
            <xsl:variable name="vSubtrees" select="$pNode/*"/>

            <xsl:sequence select=
             "f:apply($pFuncNode,
                      $pNode/@tree-nodeLabel,
                      int:foldl-tree_($pFuncNode, $pFuncSubtrees, $pA0,
                                      $vSubtrees)
                      )"
            />
         </xsl:otherwise>
      </xsl:choose>
    </xsl:function>

    <xsl:function name="int:foldl-tree_">
      <xsl:param name="pFuncNode" as="element()"/>
      <xsl:param name="pFuncSubtrees" as="element()"/>
      <xsl:param name="pA0"/>
      <xsl:param name="pSubTrees" as="element()*"/>

      <xsl:sequence select=
       "if(empty($pSubTrees))
         then $pA0
         else
           f:apply($pFuncSubtrees,
               f:foldl-tree($pFuncNode, $pFuncSubtrees, $pA0, 
$pSubTrees[1]),
               int:foldl-tree_($pFuncNode, $pFuncSubtrees, $pA0, 
$pSubTrees[position() > 1])
                   )"
       />
    </xsl:function>
</xsl:stylesheet>

In a few words, this is a generic fold() but over a tree (not just over a 
list). As such, it needs two functions to be provided as parameters -- one 
for processing the current node and one for processing all subtrees of the 
current node.

When one writes:

    f:foldl-tree($f:add, $f:add(), 0, /*)

the result of evaluating this is the sum of the values of all 
@tree-nodeLabel attributes of all nodes in the tree.

If I pass as parameters other functions, I'll perform other processing on a 
(any!) tree.

So, in case of XSLT/XQuery processing, we pass the necessary two functions 
as parameters to f:foldl-tree() and we have implemented an XSLT/XQuery 
processor.

Why is this all relevant to the current discussion?

Because: a fold() processing of any kind is essentially streaming.

Therefore, let.s just provide the required two functions and not worry how 
the function engine does streaming -- there could be reasonably efficient 
implementations. The most obvious example is a lazy implementation -- no 
subtrees are ever processed unless ultimately required.

What is more, in a lazy implementation the source tree can itself be 
evaluated lazily -- only those nodes/subtrees will need to be parsed, which 
are ultimately required.

Just as a side note -- streaming a tree implies linearization -- this may go 
against efficiency when opposed to parallelization (e.g. using a DVC (divide 
and conquer) approach), which is the ultimate strength of functional 
languages and will start to matter more and more as explained in the paper 
"The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software" 
(http://www.gotw.ca/publications/concurrency-ddj.htm) by Herb Sutter.

Parallelization may require that different threads share the same data, 
which will delay the possibility to discard this data from memory.


Cheers,

Dimitre Novatchev.


"Daniela Florescu" <dflorescu@mac.com> wrote in message 
30291DBF-590E-11D9-A33A-000393DC762C@mac.com">news:30291DBF-590E-11D9-A33A-000393DC762C@mac.com...
>> As someone who was until very recently "one of those implementers" I 
>> completely disagree with you. We had customers who want to process XML 
>> documents that hundreds of megabytes to gigabytes in size who can't 
>> afford to materialize even a fraction of these documents in certain 
>> cases.
>
>
> Dare,
>
> what exactly are you disagreeing with ?
>
> This discussion is going in zig-zag. Did you read my postings ? Did I ever 
> tell
> you that XQuery was the solution for **everything** !? I don't remember 
> saying that.
>
> I was just reading this SAX/streaming/memory consumption discussion, and
> being a person who actually designed and implemented such a streaming XML
> query processor, I had a terrible sensation of deja vu. There are solid 
> solutions
> in the published and implemented state of the art already.
>
> I was just curious to know if there are deep technical issues why people 
> have to
> reinvent such techniques. I learned that there are cases where indeed 
> there is
> no point in using preexisting XML processors, simply because they don't 
> apply,
> and people have to do it by hand.
>
> But I also learned that a lot of reinventing the wheel is also for fun. 
> I'm not gone
> comment on that. Next time I take a plane I can only cross fingers that 
> the people who
> designed the air control traffic system optimized for something different 
> then their
> programmers's fun.
>
> So I reiterate my point: there are well known techniques to maximize 
> streaming and
> minimizing memory  consumption. Many of them are already implemented in 
> existing
> systems, and many will show up in the next versions of various industrial 
> strength
> products.
>
> In a big majority of the cases, people who need to process XML don't need 
> to understand
> the gory details of buffer management. And they shouldn't. They should 
> concentrate only
> on the logic of their application, and rely on good  XSLT/XQuery compilers 
> and runtimes
> to do the right job concerning the implementation strategy.
>
> As for the well known techniques for minimizing memory consumption, I am 
> afraid that
> I cannot point to any specific technique on this mailing list, for the 
> following reasons:
>
> (a) it's too much literature to be discussed in such a forum
> (b) a lot of it is folklore
> (c) a lot of it is simply inherited from streaming and lazy evaluation of 
> SQL
> query processors, using the iterator model. (Goetz Graefe can tell you 
> much more
> about that then me, and he's closer to you), and you can imagine how much
> folklore is there too after 30 years
>
> The best idea that comes to my mind is to encourage somebody to write a
> survey of such techniques, that might be helpful.
>
> My conclusion: please rely on good compilers, good optimizers and good 
> runtimes
> instead of writing XML processors by hand if you don't *really* have to 
> (and few people
> really have to). And trust the vendors/open source implementors that they 
> will produce
> such good compilers,  optimizers and runtimes when time comes.
>
> As far as I am concerned, the horse is dead, I don't have much else to 
> add.
>
> Best regards, have a wonderful holiday season,
> Dana
>
>
>
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://www.oasis-open.org/mlmanage/index.php>
>
>

Follow-Ups:
- Re: [xml-dev] Re: Streaming XML (WAS: More on taming SAX (was Re:[xml-dev] ANN: Amara XML Toolkit 0.9.0))
  - From: Uche Ogbuji <Uche.Ogbuji@fourthought.com>
- Re: Streaming XML (WAS: More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0))
  - From: "Dimitre Novatchev" <dnovatchev@yahoo.com>

References:
- RE: [xml-dev] Streaming XML (WAS: More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0))
  - From: "Dare Obasanjo" <dareo@microsoft.com>
- Re: [xml-dev] Streaming XML (WAS: More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0))
  - From: Daniela Florescu <dflorescu@mac.com>

Prev by Date: Re: [xml-dev] Streaming XML (WAS: More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0))
Next by Date: Re: Streaming XML (WAS: More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0))
Previous by thread: Re: [xml-dev] Streaming XML (WAS: More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0))
Next by thread: Re: Streaming XML (WAS: More on taming SAX (was Re: [xml-dev] ANN: Amara XML Toolkit 0.9.0))
Index(es):
- Date
- Thread