[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] I used XSLT streaming to generate a training corpus forEnglish-Korean language translation
- From: Hermann Stamm-Wilbrandt <STAMMW@de.ibm.com>
- To: "Costello, Roger L." <costello@mitre.org>
- Date: Mon, 23 Sep 2013 18:35:09 +0200
$)CHi Roger,
interesting use case you worked on.
I had to modify your (XSLT 3.0) stylesheet again for DataPower being able
to process it in streaming mode:
$ curl --data-binary @south_korea.osm http://firestar:2111 > out.xml
% Total % Received % Xferd Average Speed Time Time Time
Current
Dload Upload Total Spent Left
Speed
100 455M 0 1985k 100 453M 180k 41.2M 0:00:10 0:00:10 --:--:--
46.9M
$
$ xpath++ "count(/English-Korean/translation)" out.xml
15568
$
How long did the transformation took on your system?
> It generated over 30,000 English-Korean pairs.
>
As can be seen above I did get 15568 translation nodes, what can be the
difference?
These are the first and last entries from my run:
$ xpath++ "/English-Korean/translation[position() <= 3]" out.xml
-------------------------------------------------------------------------------
<translation><English>Gimpo International
Airport</English><Korean>김포국제공항</Korean></translation>
-------------------------------------------------------------------------------
<translation><English>Incheon International
Airport</English><Korean>인천국제공항</Korean></translation>
-------------------------------------------------------------------------------
<translation><English>South
Korea</English><Korean>대한민국</Korean></translation>
$
$ xpath++ "/English-Korean/translation[position() >= last()-2]" out.xml
-------------------------------------------------------------------------------
<translation><English>Jung-ang-dong
Rotary</English><Korean>중앙동로터리</Korean></translation>
-------------------------------------------------------------------------------
<translation><English>Odong
Islet</English><Korean>오동도</Korean></translation>
-------------------------------------------------------------------------------
<translation><English>To Sinwon, Hapcheon,
Chunjeon</English><Korean>신원, 합천,
춘전방면</Korean></translation>
$
Mit besten Gruessen / Best wishes,
Hermann Stamm-Wilbrandt
Level 3 support for XML Compiler team and Fixpack team lead
WebSphere DataPower SOA Appliances
https://www.ibm.com/developerworks/mydeveloperworks/blogs/HermannSW/
https://twitter.com/HermannSW/ http://www.stamm-wilbrandt.de/ce/
----------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294
|------------>
| From: |
|------------>
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|"Costello, Roger L." <costello@mitre.org> |
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To: |
|------------>
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|"xml-dev@lists.xml.org" <xml-dev@lists.xml.org>, |
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date: |
|------------>
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|09/21/2013 05:36 PM |
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject: |
|------------>
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|[xml-dev] I used XSLT streaming to generate a training corpus for English-Korean language translation |
>--------------------------------------------------------------------------------------------------------------------------------------------------|
Hi Folks,
The Open Street Map XML file for South Korea
http://downloads.cloudmade.com/asia/eastern_asia/south_korea/south_korea.osm.bz2
is quite interesting. Each <node> element contains data about a thing
(airport, university, office, bus stop, etc.) in South Korea. Within each
<node> element is a <tag> element that shows the name of the thing in
English and another <tag> element that shows its name in Korean. For
example, this <node> element contains the name of an airport in English and
Korean:
<node lat="37.5582" lon="126.7906">
<tag k="name:en" v="Gimpo International Airport"/>
<tag k="name:ko" v="1hFw19A&0xGW"/>
</node>
The English name is identified by @k="name:en" and the Korean name is
identified by @k="name:ko" (@k means !.key!/ and @v means !.value!/).
These pairs of values may be collected and then used to train an
English-Korean language translator tool.
The Open Street Map XML file is quite large -- 464 MB -- so I elected to
extract all the English-Korean pairs using XSLT streaming. I wrote an XSLT
streaming program (see below) and ran it. It generated over 30,000
English-Korean pairs. Here is a sample of the output:
<English-Korean>
<translation>
<English>Gimpo International Airport</English>
<Korean>1hFw19A&0xGW</Korean>
</translation>
<translation>
<English>Incheon International Airport</English>
<Korean>@NC519A&0xGW</Korean>
</translation>
<translation>
<English>South Korea</English>
<Korean>4kGQ9N19</Korean>
</translation>
<translation>
<English>Jeju-si</English>
<Korean>A&AV=C</Korean>
</translation>
<translation>
<English>Munui</English>
<Korean>9.@G</Korean>
</translation>
<translation>
<English>Bukcheon Junction</English>
<Korean>:OC513Bw7N</Korean>
</translation>
!&
<translation>
<English>Odong Islet</English>
<Korean>?@5?55</Korean>
</translation>
<translation>
<English>To Sinwon, Hapcheon, Chunjeon</English>
<Korean>=E?x, GUC5, Ca@|9f8i</Korean>
</translation>
</English-Korean>
Here is the streaming XSLT program
-------------------------------------------------------
generate-training-corpus.xsl
-------------------------------------------------------
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
version="3.0">
<xsl:output method="xml" />
<xsl:template match="/">
<xsl:stream href="../huge-file-Korea/south_korea.xml">
<English-Korean>
<xsl:for-each select="osm">
<xsl:iterate select="node">
<xsl:variable name="thisNode" select="copy-of(.)"/>
<xsl:if test="$thisNode[tag[@k eq 'name:en'] and
tag[@k eq 'name:ko']]">
<translation>
<English><xsl:value-of select=
"$thisNode/tag[@k eq 'name:en']/@v" /></English>
<Korean><xsl:value-of select="$thisNode/tag
[@k eq 'name:ko']/@v" /></Korean>
</translation>
<xsl:next-iteration />
</xsl:if>
</xsl:iterate>
</xsl:for-each>
</English-Korean>
</xsl:stream>
</xsl:template>
</xsl:stylesheet>
/Roger
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]