[
Lists Home |
Date Index |
Thread Index
]
The tool builds Markov models based on the tags seen in training
documents and then uses Viterbi search to insert the same tags in other
documents. These are the same algorithms used to transform voice to
speech in voice recognition, but transforming between completely flat
XML documents and marked up XML documents.
The kinds of tags that can be inserted range form part of speech tags
and word segmentation tags (very useful for languages which don't
explicitly write their inter-word spaces) to tags making bibliography
information explicit. Hierarchical / nested tags can be marked up.
One requirement is a quality set of training documents, with tags used
consistently throughout. Without such a corpus, the Markov models are
unable to discriminate between tags. Sequences such as <i><b></b></i>
need to be marked up consistently to get good models.
Because the tool is very general and the applicable transformations are
in the application language (of which the tool knows nothing), it seems
unlike that the tool can do much to make these consistent, unless
someone else has already tackled the problem.
The tool will be released under the GPL when complete.
cheers
stuart
Bullard, Claude L (Len) wrote:
>So you are taking data already tagged in XML and inserting
>more markup into it, as in adding HTML tags to text nodes?
>
>1) You are right that markup systems are silent about
>these semantics. They are in the domain of the
>application language. However, in this case, a bold italic
>item and an italic bold item are rendered identically, yes,
>and rendering is the semantic yes, so why are these not
>equivalent semantically if not syntactically?
>
>What do you mean by 'similar classes of constructs'?
>
>2) An XSLT script could be used to transform this
>example.
>
>len
>
>
>From: Stuart A Yeates
>[mailto:stuart.yeates@computing-services.oxford.ac.uk]
>
>I have written a natural language modelling tool which marks up (inserts
>XML tags into) natural language documents already in XML.
>
>I have come across an issue with this tool: some users and documents
>have an expectation that <i><b></b></i> and <b><i></i></b> (and similar
>classes of constructs) are equivalent, whereas my tool sees these are
>completely distinct.
>
> From looking at at the standards, is appears that HTML, XHTML and XML
>are all silent on the semantics of situations such as this.
>
>Are there any systems or toolkits which have already been written to
>help systematise documents and corpora into a single, consistent
>representation?
>
>cheers
>stuart
>
>
>
--
Stuart Yeates stuart.yeates@computing-services.oxford.ac.uk
OSS Watch http://www.oss-watch.ac.uk/
Oxford Text Archive http://ota.ahds.ac.uk/
Humbul Humanities Hub http://www.humbul.ac.uk/
|