XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] combining XMLEvent lists

  On 28/09/2010 6:24 PM, David wrote:
>  My guess would be "XMLEvent" is refering to StAX Events.
>
> http://woodstox.codehaus.org/javadoc/stax-api/1.0/javax/xml/stream/events/XMLEvent.html

Ah yes, you're probably right. I forgot that's what they were called...

If that's the case it looks fairly easy to present a List<XMLEvent> via 
an XMLEventReader, which can be wrapped in a StaxSource and supplied to 
any Saxon interface that expects a Source, for example a DocumentBuilder.

Michael Kay
Saxonica

>
> which is a parsed XML event (startDocument, startElement  , characters 
> ... )
>
>
> David A. Lee
> dlee@calldei.com
> http://www.xmlsh.org
>
>
> On 9/28/2010 1:17 PM, Michael Kay wrote:
>>
>>  On 28/09/2010 4:13 PM, Johannes.Lichtenberger wrote:
>>> On 09/28/2010 04:33 PM, Michael Kay wrote:
>>>> Sounds fascinating, and I wish I had time to get involved. It would
>>>> certainly be elegant if you could have both the productivity of 
>>>> writing
>>>> this declaratively in XSLT and the performance of running it on Hadoop
>>>> MapReduce. Intrinsically, the two seem to fit together hand in glove,
>>>> but I suspect some engineering effort is needed to make it work.
>>> Hello Michael,
>>>
>>> I think it would be too complicated to achieve the desired grouping 
>>> with
>>> Java. Do you think it makes sense to first serialize the results and
>>> then use Saxon's XSLT 2.0 processor to achieve the results? Or do you
>>> have any direct input from a List of XMLEvents to Saxon's XSLT
>>> processor? I assume it reads XML-data from an InputSource or some kind
>>> of a stream.
>>
>> I'm not sure whether "XMLEvent" is something I'm expected to know 
>> about: you said earlier "
>>
>> I've got an Iterator with Lists (Java) out of XMLEvents, which are
>> serialized fragments
>>
>> so I assume they are just strings containing unparsed XML. That's not 
>> going to be a particularly efficient representation for processing, 
>> so the first step will be to parse each one to a tree (for example, a 
>> Saxon TinyTree).
>>
>> You then said,
>>
>> I want to find combine Lists which have the same page id and the same
>> revision timestamp
>>
>> but you left out the critical information as to whether this would 
>> always combine elements
>> that were adjacent in the list. If the groups are adjacent then you 
>> could potentially devise
>> a strategy that avoid holding all the trees in memory at the same time.
>>
>> Supplying a sequence of trees as input to Saxon grouping is not a 
>> problem. Using the s9api interface,
>> you can use a DocumentBuilder to build each tree as an XdmNode, then 
>> a sequence can be constructed using
>> the constructor public XdmValue(Iterable<XdmItem>  items), and then 
>> this XdmValue can be passed as a parameter
>> to an XsltTransformer, and a reference to the parameter can be used 
>> in<xsl:for-each-group select="$param">.
>> Using this approach the whole structure will be held in memory, but 
>> there are ways of avoiding that by going
>> to lower-level interfaces.
>>
>> Michael Kay
>> Saxonica
>>
>>
>>> It's a special case, where two or more revisions of one article are 
>>> made
>>> at the same time (in the same second). I would have to look through the
>>> XML file with BaseX or Saxon, but I'm pretty sure such cases exist
>>> somewhere in the hugh file (as of now I've only extracted a small 
>>> subset
>>> of articles and replaced WikiText inside text-elements with XML).
>>>
>>> The whole task is to sort the revisions to shredder it into our XML
>>> datastorage system (the deltas of the revisions), which has the
>>> capability to store and retrieve revisions compactly and 
>>> efficiently. In
>>> parallel I'm currently writing the import of a sorted XML file.
>>>
>>> My main task (master project and thesis) is or will be the 
>>> visualization
>>> of temporal tree structured data to gain further insights into the
>>> evolution of the data, which are otherwise very difficult to realize.
>>>
>>> regards,
>>> Johannes
>>>
>>
>>
>> _______________________________________________________________________
>>
>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>> to support XML implementation and development. To minimize
>> spam in the archives, you must subscribe before posting.
>>
>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>> subscribe: xml-dev-subscribe@lists.xml.org
>> List archive: http://lists.xml.org/archives/xml-dev/
>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
> _______________________________________________________________________
>
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
>
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
>



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS