Hi Folks,
The Open Street Map XML file for South Korea
http://downloads.cloudmade.com/asia/eastern_asia/south_korea/south_korea.osm.bz2
is quite interesting. Each <node> element contains data about a thing (airport, university, office, bus stop, etc.) in South Korea. Within each <node> element is a <tag> element that shows the name of the thing in English and another <tag>
element that shows its name in Korean. For example, this <node> element contains the name of an airport in English and Korean:
<node lat="37.5582"
lon="126.7906">
<tag k="name:en"
v="Gimpo International Airport"/>
<tag k="name:ko"
v="김포국제공항"/>
</node>
The English name is identified by @k="name:en" and the Korean name is identified by @k="name:ko" (@k means ‘key’ and @v means ‘value’).
These pairs of values may be collected and then used to train an English-Korean language translator tool.
The Open Street Map XML file is quite large -- 464 MB -- so I elected to extract all the English-Korean pairs using XSLT streaming. I wrote an XSLT streaming program (see below) and ran it. It generated over 30,000 English-Korean pairs.
Here is a sample of the output:
<English-Korean>
<translation>
<English>Gimpo International Airport</English>
<Korean>김포국제공항</Korean>
</translation>
<translation>
<English>Incheon International Airport</English>
<Korean>인천국제공항</Korean>
</translation>
<translation>
<English>South Korea</English>
<Korean>대한민국</Korean>
</translation>
<translation>
<English>Jeju-si</English>
<Korean>제주시</Korean>
</translation>
<translation>
<English>Munui</English>
<Korean>문의</Korean>
</translation>
<translation>
<English>Bukcheon Junction</English>
<Korean>북천교차로</Korean>
</translation>
…
<translation>
<English>Odong Islet</English>
<Korean>오동도</Korean>
</translation>
<translation>
<English>To Sinwon, Hapcheon, Chunjeon</English>
<Korean>신원,
합천,
춘전방면</Korean>
</translation>
</English-Korean>
Here is the streaming XSLT program
-------------------------------------------------------
generate-training-corpus.xsl
-------------------------------------------------------
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
version="3.0">
<xsl:output method="xml"
/>
<xsl:template match="/">
<xsl:stream href="../huge-file-Korea/south_korea.xml">
<English-Korean>
<xsl:for-each select="osm">
<xsl:iterate select="node">
<xsl:variable name="thisNode"
select="copy-of(.)"/>
<xsl:if test="$thisNode[tag[@k
eq 'name:en'] and tag[@k eq 'name:ko']]">
<translation>
<English><xsl:value-of
select="$thisNode/tag[@k eq 'name:en']/@v"
/></English>
<Korean><xsl:value-of
select="$thisNode/tag[@k eq 'name:ko']/@v"
/></Korean>
</translation>
<xsl:next-iteration
/>
</xsl:if>
</xsl:iterate>
</xsl:for-each>
</English-Korean>
</xsl:stream>
</xsl:template>
</xsl:stylesheet>
/Roger