OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Sc

[ Lists Home | Date Index | Thread Index ]

Jonathan Borden scripsit:

> Actually this _is_ the original point, isn't it? You are saying that using a
> specific character set isn't a reliable way to detect a human language
> (because other characters might be correctly present) 

This is not my point at all, though I do agree with it, as do you.

My point is that determining THE alphabet of English is a wild goose
chase, because different definitions exist for different uses.
For children's books, the alphabet is unquestionably a-zA-Z and
nothing else.  For more complex prose, some rarer letters are required.
Foreign words may retain their accents or not, and quotations can be in
any script at all.

This has absolutely nothing to do with detection as such.  It has
to do with *validation* that the text can be handled by some kind of
mechanization or other.

FWIW, Harald Alvestrand has done some work on the subject which can be
found at http://www.alvestrand.no/ietf/lang-chars.txt .  This work is
explicitly incomplete, most likely contains errors, and is to be used
at your own risk.

John Cowan <jcowan@reutershealth.com>     http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,    http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS