XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] I used XSLT streaming to generate a training corpus forEnglish-Korean language translation

Hi Roger,

interesting use case you worked on.

I had to modify your (XSLT 3.0) stylesheet again for DataPower being able
to process it in streaming mode:
$ curl --data-binary @south_korea.osm http://firestar:2111 > out.xml
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100  455M    0 1985k  100  453M   180k  41.2M  0:00:10  0:00:10 --:--:--
46.9M
$
$ xpath++ "count(/English-Korean/translation)" out.xml
15568
$

How long did the transformation took on your system?

> It generated over 30,000 English-Korean pairs.
>
As can be seen above I did get 15568 translation nodes, what can be the
difference?

These are the first and last entries from my run:

$ xpath++ "/English-Korean/translation[position() <= 3]" out.xml

-------------------------------------------------------------------------------
<translation><English>Gimpo International
Airport</English><Korean>&#44608;&#54252;&#44397;&#51228;&#44277;&#54637;</Korean></translation>
-------------------------------------------------------------------------------
<translation><English>Incheon International
Airport</English><Korean>&#51064;&#52380;&#44397;&#51228;&#44277;&#54637;</Korean></translation>
-------------------------------------------------------------------------------
<translation><English>South
Korea</English><Korean>&#45824;&#54620;&#48124;&#44397;</Korean></translation>
$
$ xpath++ "/English-Korean/translation[position() >= last()-2]" out.xml

-------------------------------------------------------------------------------
<translation><English>Jung-ang-dong
Rotary</English><Korean>&#51473;&#50521;&#46041;&#47196;&#53552;&#47532;</Korean></translation>
-------------------------------------------------------------------------------
<translation><English>Odong
Islet</English><Korean>&#50724;&#46041;&#46020;</Korean></translation>
-------------------------------------------------------------------------------
<translation><English>To Sinwon, Hapcheon,
Chunjeon</English><Korean>&#49888;&#50896;, &#54633;&#52380;,
&#52632;&#51204;&#48169;&#47732;</Korean></translation>
$


Mit besten Gruessen / Best wishes,

Hermann Stamm-Wilbrandt
Level 3 support for XML Compiler team and Fixpack team lead
WebSphere DataPower SOA Appliances
https://www.ibm.com/developerworks/mydeveloperworks/blogs/HermannSW/
https://twitter.com/HermannSW/     http://www.stamm-wilbrandt.de/ce/
----------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294


|------------>
| From:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |"Costello, Roger L." <costello@mitre.org>                                                                                                         |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |"xml-dev@lists.xml.org" <xml-dev@lists.xml.org>,                                                                                                  |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |09/21/2013 05:36 PM                                                                                                                               |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |[xml-dev] I used XSLT streaming to generate a training corpus for English-Korean language translation                                             |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|





Hi Folks,

The Open Street Map XML file for South Korea

http://downloads.cloudmade.com/asia/eastern_asia/south_korea/south_korea.osm.bz2
is quite interesting. Each <node> element contains data about a thing
(airport, university, office, bus stop, etc.) in South Korea. Within each
<node> element is a <tag> element that shows the name of the thing in
English and another <tag> element that shows its name in Korean. For
example, this <node> element contains the name of an airport in English and
Korean:

<node lat="37.5582" lon="126.7906">
    <tag k="name:en" v="Gimpo International Airport"/>
    <tag k="name:ko" v="김포국제殺俎공항"/>
</node>

The English name is identified by @k="name:en" and the Korean name is
identified by @k="name:ko" (@k means ‘key’ and @v means ‘value’).

These pairs of values may be collected and then used to train an
English-Korean language translator tool.

The Open Street Map XML file is quite large -- 464 MB -- so I elected to
extract all the English-Korean pairs using XSLT streaming. I wrote an XSLT
streaming program (see below) and ran it. It generated over 30,000
English-Korean pairs. Here is a sample of the output:

<English-Korean>
    <translation>
        <English>Gimpo International Airport</English>
        <Korean>김포국제殺俎공항</Korean>
    </translation>
    <translation>
        <English>Incheon International Airport</English>
        <Korean>인천국제殺俎공항</Korean>
    </translation>
    <translation>
        <English>South Korea</English>
        <Korean>대한민국</Korean>
    </translation>
    <translation>
        <English>Jeju-si</English>
        <Korean>제殺俎주시</Korean>
    </translation>
    <translation>
        <English>Munui</English>
        <Korean>문의</Korean>
    </translation>
    <translation>
        <English>Bukcheon Junction</English>
        <Korean>북천교차로</Korean>
    </translation>
    …殺俎
    <translation>
        <English>Odong Islet</English>
        <Korean>오동도</Korean>
    </translation>
    <translation>
        <English>To Sinwon, Hapcheon, Chunjeon</English>
        <Korean>신원, 합천, 춘전방면</Korean>
    </translation>
</English-Korean>

Here is the streaming XSLT program
-------------------------------------------------------
   generate-training-corpus.xsl
-------------------------------------------------------
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                xmlns:xs="http://www.w3.org/2001/XMLSchema";
                exclude-result-prefixes="#all"
                version="3.0">

    <xsl:output method="xml" />

    <xsl:template match="/">
        <xsl:stream href="../huge-file-Korea/south_korea.xml">
            <English-Korean>
                <xsl:for-each select="osm">
                    <xsl:iterate select="node">
                        <xsl:variable name="thisNode" select="copy-of(.)"/>
                        <xsl:if test="$thisNode[tag[@k eq 'name:en'] and
tag[@k eq 'name:ko']]">
                            <translation>
                                <English><xsl:value-of select=
"$thisNode/tag[@k eq 'name:en']/@v" /></English>
                                <Korean><xsl:value-of select="$thisNode/tag
[@k eq 'name:ko']/@v" /></Korean>
                            </translation>
                            <xsl:next-iteration />
                        </xsl:if>
                    </xsl:iterate>
                </xsl:for-each>
            </English-Korean>
        </xsl:stream>
    </xsl:template>

</xsl:stylesheet>

/Roger




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS