XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
I used XSLT streaming to generate a training corpus forEnglish-Korean language translation

Hi Folks,

 

The Open Street Map XML file for South Korea

 

http://downloads.cloudmade.com/asia/eastern_asia/south_korea/south_korea.osm.bz2

is quite interesting. Each <node> element contains data about a thing (airport, university, office, bus stop, etc.) in South Korea. Within each <node> element is a <tag> element that shows the name of the thing in English and another <tag> element that shows its name in Korean. For example, this <node> element contains the name of an airport in English and Korean:

 

<node lat="37.5582" lon="126.7906">
   
<tag k="name:en" v="Gimpo International Airport"/>
   
<tag k="name:ko" v="김포국제공항"/>
</node>

 

The English name is identified by @k="name:en" and the Korean name is identified by @k="name:ko" (@k means ‘key’ and @v means ‘value’).

 

These pairs of values may be collected and then used to train an English-Korean language translator tool.

 

The Open Street Map XML file is quite large -- 464 MB -- so I elected to extract all the English-Korean pairs using XSLT streaming. I wrote an XSLT streaming program (see below) and ran it. It generated over 30,000 English-Korean pairs. Here is a sample of the output:

 

<English-Korean>
   
<translation>
       
<English>Gimpo International Airport</English>
       
<Korean>김포국제공항</Korean>
   
</translation>
   
<translation>
       
<English>Incheon International Airport</English>
       
<Korean>인천국제공항</Korean>
   
</translation>
   
<translation>
       
<English>South Korea</English>
       
<Korean>대한민국</Korean>
   
</translation>
   
<translation>
       
<English>Jeju-si</English>
       
<Korean>제주시</Korean>
   
</translation>
   
<translation>
       
<English>Munui</English>
       
<Korean>문의</Korean>
   
</translation>
   
<translation>
       
<English>Bukcheon Junction</English>
       
<Korean>북천교차로</Korean>
   
</translation>

    …
    
<translation>
       
<English>Odong Islet</English>
       
<Korean>오동도</Korean>
   
</translation>
   
<translation>
       
<English>To Sinwon, Hapcheon, Chunjeon</English>
       
<Korean>신원, 합천, 춘전방면</Korean>
   
</translation>
</English-Korean>

Here is the streaming XSLT program

-------------------------------------------------------

   generate-training-corpus.xsl

-------------------------------------------------------

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                exclude-result-prefixes="#all"
                version="3.0">
   
    
<xsl:output method="xml" />
   
    
<xsl:template match="/">
       
<xsl:stream href="../huge-file-Korea/south_korea.xml">
           
<English-Korean>
               
<xsl:for-each select="osm">
                   
<xsl:iterate select="node">
                       
<xsl:variable name="thisNode" select="copy-of(.)"/>
                       
<xsl:if test="$thisNode[tag[@k eq 'name:en'] and tag[@k eq 'name:ko']]">
                           
<translation>
                               
<English><xsl:value-of select="$thisNode/tag[@k eq 'name:en']/@v" /></English>
                                
<Korean><xsl:value-of select="$thisNode/tag[@k eq 'name:ko']/@v" /></Korean>
                           
</translation>
                           
<xsl:next-iteration />
                       
</xsl:if>
                   
</xsl:iterate>
               
</xsl:for-each>
           
</English-Korean>
       
</xsl:stream>
   
</xsl:template>
   
</xsl:stylesheet>

 

/Roger



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS