[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] combining XMLEvent lists
- From: Michael Kay <mike@saxonica.com>
- To: "Johannes.Lichtenberger" <Johannes.Lichtenberger@uni-konstanz.de>
- Date: Tue, 28 Sep 2010 18:17:14 +0100
On 28/09/2010 4:13 PM, Johannes.Lichtenberger wrote:
> On 09/28/2010 04:33 PM, Michael Kay wrote:
>> Sounds fascinating, and I wish I had time to get involved. It would
>> certainly be elegant if you could have both the productivity of writing
>> this declaratively in XSLT and the performance of running it on Hadoop
>> MapReduce. Intrinsically, the two seem to fit together hand in glove,
>> but I suspect some engineering effort is needed to make it work.
> Hello Michael,
>
> I think it would be too complicated to achieve the desired grouping with
> Java. Do you think it makes sense to first serialize the results and
> then use Saxon's XSLT 2.0 processor to achieve the results? Or do you
> have any direct input from a List of XMLEvents to Saxon's XSLT
> processor? I assume it reads XML-data from an InputSource or some kind
> of a stream.
I'm not sure whether "XMLEvent" is something I'm expected to know about:
you said earlier "
I've got an Iterator with Lists (Java) out of XMLEvents, which are
serialized fragments
so I assume they are just strings containing unparsed XML. That's not
going to be a particularly efficient representation for processing, so
the first step will be to parse each one to a tree (for example, a Saxon
TinyTree).
You then said,
I want to find combine Lists which have the same page id and the same
revision timestamp
but you left out the critical information as to whether this would always combine elements
that were adjacent in the list. If the groups are adjacent then you could potentially devise
a strategy that avoid holding all the trees in memory at the same time.
Supplying a sequence of trees as input to Saxon grouping is not a problem. Using the s9api interface,
you can use a DocumentBuilder to build each tree as an XdmNode, then a sequence can be constructed using
the constructor public XdmValue(Iterable<XdmItem> items), and then this XdmValue can be passed as a parameter
to an XsltTransformer, and a reference to the parameter can be used in<xsl:for-each-group select="$param">.
Using this approach the whole structure will be held in memory, but there are ways of avoiding that by going
to lower-level interfaces.
Michael Kay
Saxonica
> It's a special case, where two or more revisions of one article are made
> at the same time (in the same second). I would have to look through the
> XML file with BaseX or Saxon, but I'm pretty sure such cases exist
> somewhere in the hugh file (as of now I've only extracted a small subset
> of articles and replaced WikiText inside text-elements with XML).
>
> The whole task is to sort the revisions to shredder it into our XML
> datastorage system (the deltas of the revisions), which has the
> capability to store and retrieve revisions compactly and efficiently. In
> parallel I'm currently writing the import of a sorted XML file.
>
> My main task (master project and thesis) is or will be the visualization
> of temporal tree structured data to gain further insights into the
> evolution of the data, which are otherwise very difficult to realize.
>
> regards,
> Johannes
>
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]