xml-dev - Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Sc

Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Sc

[ Lists Home | Date Index | Thread Index ]

To: "'xml-dev'" <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Schema,
From: "Rick Jelliffe" <ricko@allette.com.au>
Date: Fri, 28 Jun 2002 00:04:14 +1000
References: <200206271206.IAA10300@mail.reutershealth.com> <00a801c21dd5$5e54cad0$0201a8c0@ne.mediaone.net> <021201c21dda$bfac4790$4bc8a8c0@AlletteSystems.com> <00b401c21dd7$e6398a60$0201a8c0@ne.mediaone.net>

From: "Jonathan Borden" <jborden@attbi.com>

> The issue of detection of human language, on the other hand, is one that
> interests me. 

François Yergeau had a paper on this at a conference. Probably Robin Cover's 
site has the reference. He told me it was quite possible, but of course it must
depend on the document size to some extent.

See  http://www.alis.com/castil/silc/?AlisTargetHost=http://www.alis.com:8080
for the commercialization.

I have just been looking for public domain tables giving the liklihood of
various trigrams (groups of three letters) occurring in different languages
(because this is a useful thing for detecting OCR errors in text which you
might not want to spell-check for various reasons) but it seems that none
exist. Lots of papers reference them, but it looks like a definitive collection
has not come yet.  (One good approach to doing this would be to take the
spelling tables from aspell and generate them.)

Cheers
Rick Jelliffe

References:
- Re: English sentences, was: Re: [xml-dev] Announce: XML Schema,
  - From: John Cowan <jcowan@reutershealth.com>
- Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Schema,
  - From: "Jonathan Borden" <jborden@attbi.com>
- Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Schema,
  - From: "Rick Jelliffe" <ricko@allette.com.au>
- Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Schema,
  - From: "Jonathan Borden" <jborden@attbi.com>

Prev by Date: Re: [xml-dev] English sentences, was: Re: [xml-dev] Announce:XMLSchema,
Next by Date: Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Schema,
Previous by thread: Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Schema,
Next by thread: Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Schema,
Index(es):
- Date
- Thread