[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] combining XMLEvent lists
- From: Michael Kay <mike@saxonica.com>
- To: xml-dev@lists.xml.org
- Date: Tue, 28 Sep 2010 18:46:15 +0100
On 28/09/2010 6:24 PM, David wrote:
> My guess would be "XMLEvent" is refering to StAX Events.
>
> http://woodstox.codehaus.org/javadoc/stax-api/1.0/javax/xml/stream/events/XMLEvent.html
Ah yes, you're probably right. I forgot that's what they were called...
If that's the case it looks fairly easy to present a List<XMLEvent> via
an XMLEventReader, which can be wrapped in a StaxSource and supplied to
any Saxon interface that expects a Source, for example a DocumentBuilder.
Michael Kay
Saxonica
>
> which is a parsed XML event (startDocument, startElement , characters
> ... )
>
>
> David A. Lee
> dlee@calldei.com
> http://www.xmlsh.org
>
>
> On 9/28/2010 1:17 PM, Michael Kay wrote:
>>
>> On 28/09/2010 4:13 PM, Johannes.Lichtenberger wrote:
>>> On 09/28/2010 04:33 PM, Michael Kay wrote:
>>>> Sounds fascinating, and I wish I had time to get involved. It would
>>>> certainly be elegant if you could have both the productivity of
>>>> writing
>>>> this declaratively in XSLT and the performance of running it on Hadoop
>>>> MapReduce. Intrinsically, the two seem to fit together hand in glove,
>>>> but I suspect some engineering effort is needed to make it work.
>>> Hello Michael,
>>>
>>> I think it would be too complicated to achieve the desired grouping
>>> with
>>> Java. Do you think it makes sense to first serialize the results and
>>> then use Saxon's XSLT 2.0 processor to achieve the results? Or do you
>>> have any direct input from a List of XMLEvents to Saxon's XSLT
>>> processor? I assume it reads XML-data from an InputSource or some kind
>>> of a stream.
>>
>> I'm not sure whether "XMLEvent" is something I'm expected to know
>> about: you said earlier "
>>
>> I've got an Iterator with Lists (Java) out of XMLEvents, which are
>> serialized fragments
>>
>> so I assume they are just strings containing unparsed XML. That's not
>> going to be a particularly efficient representation for processing,
>> so the first step will be to parse each one to a tree (for example, a
>> Saxon TinyTree).
>>
>> You then said,
>>
>> I want to find combine Lists which have the same page id and the same
>> revision timestamp
>>
>> but you left out the critical information as to whether this would
>> always combine elements
>> that were adjacent in the list. If the groups are adjacent then you
>> could potentially devise
>> a strategy that avoid holding all the trees in memory at the same time.
>>
>> Supplying a sequence of trees as input to Saxon grouping is not a
>> problem. Using the s9api interface,
>> you can use a DocumentBuilder to build each tree as an XdmNode, then
>> a sequence can be constructed using
>> the constructor public XdmValue(Iterable<XdmItem> items), and then
>> this XdmValue can be passed as a parameter
>> to an XsltTransformer, and a reference to the parameter can be used
>> in<xsl:for-each-group select="$param">.
>> Using this approach the whole structure will be held in memory, but
>> there are ways of avoiding that by going
>> to lower-level interfaces.
>>
>> Michael Kay
>> Saxonica
>>
>>
>>> It's a special case, where two or more revisions of one article are
>>> made
>>> at the same time (in the same second). I would have to look through the
>>> XML file with BaseX or Saxon, but I'm pretty sure such cases exist
>>> somewhere in the hugh file (as of now I've only extracted a small
>>> subset
>>> of articles and replaced WikiText inside text-elements with XML).
>>>
>>> The whole task is to sort the revisions to shredder it into our XML
>>> datastorage system (the deltas of the revisions), which has the
>>> capability to store and retrieve revisions compactly and
>>> efficiently. In
>>> parallel I'm currently writing the import of a sorted XML file.
>>>
>>> My main task (master project and thesis) is or will be the
>>> visualization
>>> of temporal tree structured data to gain further insights into the
>>> evolution of the data, which are otherwise very difficult to realize.
>>>
>>> regards,
>>> Johannes
>>>
>>
>>
>> _______________________________________________________________________
>>
>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>> to support XML implementation and development. To minimize
>> spam in the archives, you must subscribe before posting.
>>
>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>> subscribe: xml-dev-subscribe@lists.xml.org
>> List archive: http://lists.xml.org/archives/xml-dev/
>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
> _______________________________________________________________________
>
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
>
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
>
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]