[
Lists Home |
Date Index |
Thread Index
]
From: "Jonathan Borden" <jborden@attbi.com>
> The issue of detection of human language, on the other hand, is one that
> interests me.
François Yergeau had a paper on this at a conference. Probably Robin Cover's
site has the reference. He told me it was quite possible, but of course it must
depend on the document size to some extent.
See http://www.alis.com/castil/silc/?AlisTargetHost=http://www.alis.com:8080
for the commercialization.
I have just been looking for public domain tables giving the liklihood of
various trigrams (groups of three letters) occurring in different languages
(because this is a useful thing for detecting OCR errors in text which you
might not want to spell-check for various reasons) but it seems that none
exist. Lots of papers reference them, but it looks like a definitive collection
has not come yet. (One good approach to doing this would be to take the
spelling tables from aspell and generate them.)
Cheers
Rick Jelliffe
|