[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] combining XMLEvent lists
- From: David <dlee@calldei.com>
- To: xml-dev@lists.xml.org
- Date: Tue, 28 Sep 2010 13:55:15 -0400
I have such a beast if your interested, or pieces of it.
In xmlsh I've experimented with StAX pipelines which are queues of StAX
Events
These are prety much exactly as Michael describes below, although in my
case they are multi-threaded with a reader on one end and a writer on
the other, but the underlying techniques work well.
The source is available at sourceforge ( follow the link from www.xmlsh.org)
I can help you locate the relevant files if interested.
Interestingly though, I have found the overhead of using StAX in a
pipeline to be *more overhead* then using text
serialization/deserialization. Your case may differ, but something to
consider. After about a month (pt) hard work to get this magic binary
StAX pipeline to work with imaginations of it being like 10x faster
then text ... I was disheartened to discover it was about 20% *slower*.
David A. Lee
dlee@calldei.com
http://www.xmlsh.org
On 9/28/2010 1:46 PM, Michael Kay wrote:
>
> On 28/09/2010 6:24 PM, David wrote:
>> My guess would be "XMLEvent" is refering to StAX Events.
>>
>> http://woodstox.codehaus.org/javadoc/stax-api/1.0/javax/xml/stream/events/XMLEvent.html
>>
>
> Ah yes, you're probably right. I forgot that's what they were called...
>
> If that's the case it looks fairly easy to present a List<XMLEvent>
> via an XMLEventReader, which can be wrapped in a StaxSource and
> supplied to any Saxon interface that expects a Source, for example a
> DocumentBuilder.
>
> Michael Kay
> Saxonica
>
>>
>> which is a parsed XML event (startDocument, startElement ,
>> characters ... )
>>
>>
>> David A. Lee
>> dlee@calldei.com
>> http://www.xmlsh.org
>>
>>
>> On 9/28/2010 1:17 PM, Michael Kay wrote:
>>>
>>> On 28/09/2010 4:13 PM, Johannes.Lichtenberger wrote:
>>>> On 09/28/2010 04:33 PM, Michael Kay wrote:
>>>>> Sounds fascinating, and I wish I had time to get involved. It would
>>>>> certainly be elegant if you could have both the productivity of
>>>>> writing
>>>>> this declaratively in XSLT and the performance of running it on
>>>>> Hadoop
>>>>> MapReduce. Intrinsically, the two seem to fit together hand in glove,
>>>>> but I suspect some engineering effort is needed to make it work.
>>>> Hello Michael,
>>>>
>>>> I think it would be too complicated to achieve the desired grouping
>>>> with
>>>> Java. Do you think it makes sense to first serialize the results and
>>>> then use Saxon's XSLT 2.0 processor to achieve the results? Or do you
>>>> have any direct input from a List of XMLEvents to Saxon's XSLT
>>>> processor? I assume it reads XML-data from an InputSource or some kind
>>>> of a stream.
>>>
>>> I'm not sure whether "XMLEvent" is something I'm expected to know
>>> about: you said earlier "
>>>
>>> I've got an Iterator with Lists (Java) out of XMLEvents, which are
>>> serialized fragments
>>>
>>> so I assume they are just strings containing unparsed XML. That's
>>> not going to be a particularly efficient representation for
>>> processing, so the first step will be to parse each one to a tree
>>> (for example, a Saxon TinyTree).
>>>
>>> You then said,
>>>
>>> I want to find combine Lists which have the same page id and the same
>>> revision timestamp
>>>
>>> but you left out the critical information as to whether this would
>>> always combine elements
>>> that were adjacent in the list. If the groups are adjacent then you
>>> could potentially devise
>>> a strategy that avoid holding all the trees in memory at the same time.
>>>
>>> Supplying a sequence of trees as input to Saxon grouping is not a
>>> problem. Using the s9api interface,
>>> you can use a DocumentBuilder to build each tree as an XdmNode, then
>>> a sequence can be constructed using
>>> the constructor public XdmValue(Iterable<XdmItem> items), and then
>>> this XdmValue can be passed as a parameter
>>> to an XsltTransformer, and a reference to the parameter can be used
>>> in<xsl:for-each-group select="$param">.
>>> Using this approach the whole structure will be held in memory, but
>>> there are ways of avoiding that by going
>>> to lower-level interfaces.
>>>
>>> Michael Kay
>>> Saxonica
>>>
>>>
>>>> It's a special case, where two or more revisions of one article are
>>>> made
>>>> at the same time (in the same second). I would have to look through
>>>> the
>>>> XML file with BaseX or Saxon, but I'm pretty sure such cases exist
>>>> somewhere in the hugh file (as of now I've only extracted a small
>>>> subset
>>>> of articles and replaced WikiText inside text-elements with XML).
>>>>
>>>> The whole task is to sort the revisions to shredder it into our XML
>>>> datastorage system (the deltas of the revisions), which has the
>>>> capability to store and retrieve revisions compactly and
>>>> efficiently. In
>>>> parallel I'm currently writing the import of a sorted XML file.
>>>>
>>>> My main task (master project and thesis) is or will be the
>>>> visualization
>>>> of temporal tree structured data to gain further insights into the
>>>> evolution of the data, which are otherwise very difficult to realize.
>>>>
>>>> regards,
>>>> Johannes
>>>>
>>>
>>>
>>> _______________________________________________________________________
>>>
>>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>>> to support XML implementation and development. To minimize
>>> spam in the archives, you must subscribe before posting.
>>>
>>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>>> subscribe: xml-dev-subscribe@lists.xml.org
>>> List archive: http://lists.xml.org/archives/xml-dev/
>>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>>
>> _______________________________________________________________________
>>
>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>> to support XML implementation and development. To minimize
>> spam in the archives, you must subscribe before posting.
>>
>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>> subscribe: xml-dev-subscribe@lists.xml.org
>> List archive: http://lists.xml.org/archives/xml-dev/
>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>>
>>
>
>
> _______________________________________________________________________
>
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
>
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]