xml-dev - Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Sc

Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Sc

[ Lists Home | Date Index | Thread Index ]

To: jborden@attbi.com (Jonathan Borden)
Subject: Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Schema,
From: John Cowan <jcowan@reutershealth.com>
Date: Thu, 27 Jun 2002 10:48:32 -0400 (EDT)
Cc: jcowan@reutershealth.com (John Cowan), tpassin@comcast.net (Thomas B. Passin), xml-dev@lists.xml.org ('xml-dev')
In-reply-to: <00a801c21dd5$5e54cad0$0201a8c0@ne.mediaone.net> from "Jonathan Borden" at Jun 27, 2002 08:22:56 AM

Jonathan Borden scripsit:

> Actually this _is_ the original point, isn't it? You are saying that using a
> specific character set isn't a reliable way to detect a human language
> (because other characters might be correctly present) 

This is not my point at all, though I do agree with it, as do you.

My point is that determining THE alphabet of English is a wild goose
chase, because different definitions exist for different uses.
For children's books, the alphabet is unquestionably a-zA-Z and
nothing else.  For more complex prose, some rarer letters are required.
Foreign words may retain their accents or not, and quotations can be in
any script at all.

This has absolutely nothing to do with detection as such.  It has
to do with *validation* that the text can be handled by some kind of
mechanization or other.

FWIW, Harald Alvestrand has done some work on the subject which can be
found at http://www.alvestrand.no/ietf/lang-chars.txt .  This work is
explicitly incomplete, most likely contains errors, and is to be used
at your own risk.

-- 
John Cowan <jcowan@reutershealth.com>     http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,    http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_

References:
- Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Schema,
  - From: "Jonathan Borden" <jborden@attbi.com>

Prev by Date: Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Schema,
Next by Date: RE: [xml-dev] Saving a DOM Tree that's in IE
Previous by thread: Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Schema,
Next by thread: Re: [xml-dev] English sentences, was: Re: [xml-dev] Announce: XMLSchema,
Index(es):
- Date
- Thread