xml-dev - Re: [xml-dev] equivalence of and et. al. ?

Re: [xml-dev] equivalence of and et. al. ?

[ Lists Home | Date Index | Thread Index ]

To: "Bullard, Claude L (Len)" <clbullar@ingr.com>
Subject: Re: [xml-dev] equivalence of and et. al. ?
From: Stuart A Yeates <stuart.yeates@computing-services.oxford.ac.uk>
Date: Mon, 23 Feb 2004 21:12:48 +0000
Cc: 'Stuart A Yeates' <stuart.yeates@computing-services.oxford.ac.uk>, xml-dev@lists.xml.org
In-reply-to: <15725CF6AFE2F34DB8A5B4770B7334EE03F9F3D2@hq1.pcmail.ingr.com>
References: <15725CF6AFE2F34DB8A5B4770B7334EE03F9F3D2@hq1.pcmail.ingr.com>
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040122 Debian/1.6-1

The tool builds Markov models based on the  tags seen in training 
documents and then uses Viterbi search to insert the same tags in other 
documents. These are the same algorithms used to transform voice to 
speech in voice recognition, but transforming between completely flat 
XML documents and marked up XML documents.

The kinds of tags that can be inserted range form part of speech tags 
and word segmentation tags (very useful for languages which don't 
explicitly write their inter-word spaces) to tags making bibliography 
information explicit. Hierarchical / nested tags can be marked up.

One requirement is a quality set of training documents, with tags used 
consistently throughout. Without such a corpus, the Markov models are 
unable to discriminate between tags. Sequences such as <i><b></b></i> 
need to be marked up consistently to get good models.

Because the tool is very general and the applicable transformations are 
in the application language (of which the tool knows nothing), it seems 
unlike that the tool can do much to make these consistent, unless 
someone else has already tackled the problem.

The tool will be released under the GPL when complete.

cheers
stuart

Bullard, Claude L (Len) wrote:

>So you are taking data already tagged in XML and inserting 
>more markup into it, as in adding HTML tags to text nodes?
>
>1)  You are right that markup systems are silent about 
>these semantics.  They are in the domain of the 
>application language.   However, in this case, a bold italic  
>item and an italic bold item are rendered identically, yes, 
>and rendering is the semantic yes, so why are these not 
>equivalent semantically if not syntactically?
>
>What do you mean by 'similar classes of constructs'?
>
>2)  An XSLT script could be used to transform this 
>example.  
>
>len
>
>
>From: Stuart A Yeates
>[mailto:stuart.yeates@computing-services.oxford.ac.uk]
>
>I have written a natural language modelling tool which marks up (inserts 
>XML tags into) natural language documents already in XML.
>
>I have come across an issue with this tool: some users and documents 
>have an expectation that <i><b></b></i> and <b><i></i></b> (and similar 
>classes of constructs) are equivalent, whereas my tool sees these are 
>completely distinct.
>
> From looking at at the standards, is appears that HTML, XHTML and XML 
>are all silent on the semantics of situations such as this.
>
>Are there any systems or toolkits which have already been written to 
>help systematise documents and corpora into a single, consistent 
>representation?
>
>cheers
>stuart
>
>  
>

-- 
Stuart Yeates            stuart.yeates@computing-services.oxford.ac.uk
OSS Watch                                  http://www.oss-watch.ac.uk/
Oxford Text Archive                             http://ota.ahds.ac.uk/
Humbul Humanities Hub                         http://www.humbul.ac.uk/

References:
- RE: [xml-dev] equivalence of and et. al. ?
 - From: "Bullard, Claude L (Len)" <clbullar@ingr.com>

Prev by Date: Re: [xml-dev] Piccolo Java SAX parser and others in the wild?
Next by Date: RE: [xml-dev] Piccolo Java SAX parser and others in the wild?
Previous by thread: RE: [xml-dev] equivalence of and et. al. ?
Index(es):
- Date
- Thread