XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
I processed a 3GB XML file ... using XSLT streaming

Hi Folks,

I processed this 3GB Open Street Map (OSM) XML file for the state of Massachusetts:

http://downloads.cloudmade.com/americas/northern_america/united_states/massachusetts/massachusetts.osm.bz2

An OSM file consists of <node> elements (a node represents a point), <way> elements (a way represents a line or area), and <relation> elements (a relationship represents a relationship between other elements). This page describes the elements:

http://wiki.openstreetmap.org/wiki/Tags

Here is a snippet of the XML file; it shows the data for one of the schools in Massachusetts (a school is identified by @k="amenity" @v="school"):

<osm version="0.6" generator="osm-extract.pl"> 
    ...
    <node id="358264143" version="1" timestamp="2009-03-10T04:54:34Z" 
	  uid="4732" user="iandees" changeset="774950" lat="42.2017681" 
	  lon="-70.7561527">
        <tag k="gnis:created" v="08/27/2002"/>
        <tag k="gnis:county_id" v="023"/>
        <tag k="name" v="Scituate Center Central School"/>
        <tag k="amenity" v="school"/>
        <tag k="gnis:feature_id" v="602607"/>
        <tag k="gnis:state_id" v="25"/>
        <tag k="ele" v="34"/>
    </node>
    ... 
</osm>

Problem: What are all the schools in the state of Massachusetts?

I was able to answer that problem using the new streaming capability in XSLT 3.0.

Below are two versions of an XSLT program. The first version (non-streaming version) is how one might try to solve the problem prior to XSLT streaming. I ran it on the XML file and it quickly halted with an "Out of memory" error message. The second version uses XSLT streaming and it solved the problem. There are over 6000 schools in Massachusetts! At the bottom of this message I show some of the schools.

How long did it take for the streaming XSLT program to process the 3GB XML file? Answer: 134 seconds.

I ran my streaming XSLT program using oXygen XML, which invoked SAXON. I ran it on my laptop running Windows 7.

--------------------------------------------------
    Non-streaming approach
--------------------------------------------------
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                         version="3.0">
    
    <xsl:output method="xml" />
    
    <xsl:variable name="MA" select="doc('../huge-file/massachusetts.xml')" />
    
    <xsl:template match="/">
        <xsl:apply-templates select="$MA/osm" />
    </xsl:template>
    
    <xsl:template match="osm">
        <Schools>
            <xsl:for-each select="node">
                <xsl:if test="tag[(@k eq 'amenity') and (@v eq 'school')]">
                    <school>
                        <xsl:value-of select="tag[@k eq 'name']/@v" />
                    </school>
                </xsl:if>
            </xsl:for-each>
        </Schools>
    </xsl:template>
    
</xsl:stylesheet>

--------------------------------------------------
    Streaming approach
--------------------------------------------------
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                         xmlns:xs="http://www.w3.org/2001/XMLSchema";
                         exclude-result-prefixes="#all"
                         version="3.0">
    
    <xsl:output method="xml" />
    
    <xsl:template match="/">
        <xsl:stream href="../huge-file/massachusetts.xml">
            <count>
                <xsl:for-each select="osm">
                    <xsl:iterate select="node">
                        <xsl:param name="count" select="0" as="xs:decimal"/>
                        <xsl:next-iteration>
                            <xsl:with-param name="count" select="$count+1"/>
                        </xsl:next-iteration>
                        <xsl:on-completion>
                            <xsl:value-of select="$count"/>
                        </xsl:on-completion>
                    </xsl:iterate>
                </xsl:for-each>
            </count>
        </xsl:stream>
    </xsl:template>
    
</xsl:stylesheet>

--------------------------------------------------
    Schools in Massachusetts
--------------------------------------------------
<Schools>
    <school>1. Wayland Middle School</school>
    <school>2. Walnut Hill School for the Arts</school>
    <school>3. Library</school>
    <school>4. Resource Center</school>
    <school>5. Deerfield Academy</school>
    <school>6. Arthur W Coolidge Middle</school>
    <school>7. Lilliput School</school>
    <school>8. L H Coffin</school>
    <school>9. St Joseph Central High</school>
    <school>10. Stoneham High School</school>
    <school>11. St Louis</school>
    <school>12. Franklin</school>
    <school>13. East Falmouth Elem</school>
    <school>14. May Institute (Woburn)</school>
    <school>15. Frolio Jr Hs</school>
    <school>16. Barbieri Elem</school>
    <school>17. Bishop Stang High</school>
    <school>18. Jackson Mann</school>
    <school>19. Douglas</school>
    <school>20. Brookfield Elementary</school>

    ...

    <school>5996. Hawley Grammar School (was here)</school>
    <school>5997. Shady Hill School</school>
    <school>5998. Cotting School</school>
    <school>5999. Morril 2</school>
    <school>6000. Morril 4 North</school>
    <school>6001. Morril 3</school>
    <school>6002. John W Wynn Middle School</school>
    <school>6003. John W. Wynn Middle</school>
    <school>6004. Boston Architectural College</school>
    <school>6005. High School</school>
    <school>6006. Sargent College</school>
    <school>6007. Stop -n- Go Driving Academy</school>
    <school>6008. Ashland High School</school>
    <school>6009. Center Stage Dance Academy</school>
</Schools>


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS