Lists Home |
Date Index |
Jonathan Borden scripsit:
> Actually this _is_ the original point, isn't it? You are saying that using a
> specific character set isn't a reliable way to detect a human language
> (because other characters might be correctly present)
This is not my point at all, though I do agree with it, as do you.
My point is that determining THE alphabet of English is a wild goose
chase, because different definitions exist for different uses.
For children's books, the alphabet is unquestionably a-zA-Z and
nothing else. For more complex prose, some rarer letters are required.
Foreign words may retain their accents or not, and quotations can be in
any script at all.
This has absolutely nothing to do with detection as such. It has
to do with *validation* that the text can be handled by some kind of
mechanization or other.
FWIW, Harald Alvestrand has done some work on the subject which can be
found at http://www.alvestrand.no/ietf/lang-chars.txt . This work is
explicitly incomplete, most likely contains errors, and is to be used
at your own risk.
John Cowan <firstname.lastname@example.org> http://www.reutershealth.com
I amar prestar aen, han mathon ne nen, http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_