XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] I used XSLT streaming to generate a training corpus forEnglish-Korean language translation

Great Example, Roger,

Maybe you could try processing bigger files and different kinds of processing.

One example is a periodic processing of the current and
previous-current file and determining all latest changes that occurred
during this period. Then producing change-documents by region.

This involves a synchronized (double) streaming and would be both
challenging and instructive.


Cheers,
Dimitre


On Sat, Sep 21, 2013 at 8:34 AM, Costello, Roger L. <costello@mitre.org> wrote:
> Hi Folks,
>
>
>
> The Open Street Map XML file for South Korea
>
>
>
> http://downloads.cloudmade.com/asia/eastern_asia/south_korea/south_korea.osm.bz2
>
> is quite interesting. Each <node> element contains data about a thing
> (airport, university, office, bus stop, etc.) in South Korea. Within each
> <node> element is a <tag> element that shows the name of the thing in
> English and another <tag> element that shows its name in Korean. For
> example, this <node> element contains the name of an airport in English and
> Korean:
>
>
>
> <node lat="37.5582" lon="126.7906">
>     <tag k="name:en" v="Gimpo International Airport"/>
>     <tag k="name:ko" v="김포국제공항"/>
> </node>
>
>
>
> The English name is identified by @k="name:en" and the Korean name is
> identified by @k="name:ko" (@k means ‘key’ and @v means ‘value’).
>
>
>
> These pairs of values may be collected and then used to train an
> English-Korean language translator tool.
>
>
>
> The Open Street Map XML file is quite large -- 464 MB -- so I elected to
> extract all the English-Korean pairs using XSLT streaming. I wrote an XSLT
> streaming program (see below) and ran it. It generated over 30,000
> English-Korean pairs. Here is a sample of the output:
>
>
>
> <English-Korean>
>     <translation>
>         <English>Gimpo International Airport</English>
>         <Korean>김포국제공항</Korean>
>     </translation>
>     <translation>
>         <English>Incheon International Airport</English>
>         <Korean>인천국제공항</Korean>
>     </translation>
>     <translation>
>         <English>South Korea</English>
>         <Korean>대한민국</Korean>
>     </translation>
>     <translation>
>         <English>Jeju-si</English>
>         <Korean>제주시</Korean>
>     </translation>
>     <translation>
>         <English>Munui</English>
>         <Korean>문의</Korean>
>     </translation>
>     <translation>
>         <English>Bukcheon Junction</English>
>         <Korean>북천교차로</Korean>
>     </translation>
>
>     …
>     <translation>
>         <English>Odong Islet</English>
>         <Korean>오동도</Korean>
>     </translation>
>     <translation>
>         <English>To Sinwon, Hapcheon, Chunjeon</English>
>         <Korean>신원, 합천, 춘전방면</Korean>
>     </translation>
> </English-Korean>
>
> Here is the streaming XSLT program
>
> -------------------------------------------------------
>
>    generate-training-corpus.xsl
>
> -------------------------------------------------------
>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
>                 xmlns:xs="http://www.w3.org/2001/XMLSchema";
>                 exclude-result-prefixes="#all"
>                 version="3.0">
>
>     <xsl:output method="xml" />
>
>     <xsl:template match="/">
>         <xsl:stream href="../huge-file-Korea/south_korea.xml">
>             <English-Korean>
>                 <xsl:for-each select="osm">
>                     <xsl:iterate select="node">
>                         <xsl:variable name="thisNode" select="copy-of(.)"/>
>                         <xsl:if test="$thisNode[tag[@k eq 'name:en'] and
> tag[@k eq 'name:ko']]">
>                             <translation>
>                                 <English><xsl:value-of
> select="$thisNode/tag[@k eq 'name:en']/@v" /></English>
>                                 <Korean><xsl:value-of
> select="$thisNode/tag[@k eq 'name:ko']/@v" /></Korean>
>                             </translation>
>                             <xsl:next-iteration />
>                         </xsl:if>
>                     </xsl:iterate>
>                 </xsl:for-each>
>             </English-Korean>
>         </xsl:stream>
>     </xsl:template>
>
> </xsl:stylesheet>
>
>
>
> /Roger



-- 
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
To avoid situations in which you might make mistakes may be the
biggest mistake of all
------------------------------------
Quality means doing it right when no one is looking.
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play
-------------------------------------
Facts do not cease to exist because they are ignored.
-------------------------------------
Typing monkeys will write all Shakespeare's works in 200yrs.Will they
write all patents, too? :)
-------------------------------------
I finally figured out the only reason to be alive is to enjoy it.


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS