[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] I used XSLT streaming to generate a training corpus forEnglish-Korean language translation
- From: Dimitre Novatchev <dnovatchev@gmail.com>
- To: "Costello, Roger L." <costello@mitre.org>
- Date: Sat, 21 Sep 2013 09:46:49 -0700
Great Example, Roger,
Maybe you could try processing bigger files and different kinds of processing.
One example is a periodic processing of the current and
previous-current file and determining all latest changes that occurred
during this period. Then producing change-documents by region.
This involves a synchronized (double) streaming and would be both
challenging and instructive.
Cheers,
Dimitre
On Sat, Sep 21, 2013 at 8:34 AM, Costello, Roger L. <costello@mitre.org> wrote:
> Hi Folks,
>
>
>
> The Open Street Map XML file for South Korea
>
>
>
> http://downloads.cloudmade.com/asia/eastern_asia/south_korea/south_korea.osm.bz2
>
> is quite interesting. Each <node> element contains data about a thing
> (airport, university, office, bus stop, etc.) in South Korea. Within each
> <node> element is a <tag> element that shows the name of the thing in
> English and another <tag> element that shows its name in Korean. For
> example, this <node> element contains the name of an airport in English and
> Korean:
>
>
>
> <node lat="37.5582" lon="126.7906">
> <tag k="name:en" v="Gimpo International Airport"/>
> <tag k="name:ko" v="김포국제공항"/>
> </node>
>
>
>
> The English name is identified by @k="name:en" and the Korean name is
> identified by @k="name:ko" (@k means ‘key’ and @v means ‘value’).
>
>
>
> These pairs of values may be collected and then used to train an
> English-Korean language translator tool.
>
>
>
> The Open Street Map XML file is quite large -- 464 MB -- so I elected to
> extract all the English-Korean pairs using XSLT streaming. I wrote an XSLT
> streaming program (see below) and ran it. It generated over 30,000
> English-Korean pairs. Here is a sample of the output:
>
>
>
> <English-Korean>
> <translation>
> <English>Gimpo International Airport</English>
> <Korean>김포국제공항</Korean>
> </translation>
> <translation>
> <English>Incheon International Airport</English>
> <Korean>인천국제공항</Korean>
> </translation>
> <translation>
> <English>South Korea</English>
> <Korean>대한민국</Korean>
> </translation>
> <translation>
> <English>Jeju-si</English>
> <Korean>제주시</Korean>
> </translation>
> <translation>
> <English>Munui</English>
> <Korean>문의</Korean>
> </translation>
> <translation>
> <English>Bukcheon Junction</English>
> <Korean>북천교차로</Korean>
> </translation>
>
> …
> <translation>
> <English>Odong Islet</English>
> <Korean>오동도</Korean>
> </translation>
> <translation>
> <English>To Sinwon, Hapcheon, Chunjeon</English>
> <Korean>신원, 합천, 춘전방면</Korean>
> </translation>
> </English-Korean>
>
> Here is the streaming XSLT program
>
> -------------------------------------------------------
>
> generate-training-corpus.xsl
>
> -------------------------------------------------------
>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> xmlns:xs="http://www.w3.org/2001/XMLSchema"
> exclude-result-prefixes="#all"
> version="3.0">
>
> <xsl:output method="xml" />
>
> <xsl:template match="/">
> <xsl:stream href="../huge-file-Korea/south_korea.xml">
> <English-Korean>
> <xsl:for-each select="osm">
> <xsl:iterate select="node">
> <xsl:variable name="thisNode" select="copy-of(.)"/>
> <xsl:if test="$thisNode[tag[@k eq 'name:en'] and
> tag[@k eq 'name:ko']]">
> <translation>
> <English><xsl:value-of
> select="$thisNode/tag[@k eq 'name:en']/@v" /></English>
> <Korean><xsl:value-of
> select="$thisNode/tag[@k eq 'name:ko']/@v" /></Korean>
> </translation>
> <xsl:next-iteration />
> </xsl:if>
> </xsl:iterate>
> </xsl:for-each>
> </English-Korean>
> </xsl:stream>
> </xsl:template>
>
> </xsl:stylesheet>
>
>
>
> /Roger
--
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
To avoid situations in which you might make mistakes may be the
biggest mistake of all
------------------------------------
Quality means doing it right when no one is looking.
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play
-------------------------------------
Facts do not cease to exist because they are ignored.
-------------------------------------
Typing monkeys will write all Shakespeare's works in 200yrs.Will they
write all patents, too? :)
-------------------------------------
I finally figured out the only reason to be alive is to enjoy it.
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]