OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Re: English sentences, was: Re: [xml-dev] Announce: XML Sc

[ Lists Home | Date Index | Thread Index ]

John Cowan wrote:


> Jonathan Borden scripsit:
>
> > It all depends on what exactly you want, or intend the validator to do.
What
> > you are saying, in essense, is that an "English sentence" is not defined
as
> > a sequence of characters which conform to "text-en" and this is most
true.
>
> The original point seems to have gotten lost.

Actually this _is_ the original point, isn't it? You are saying that using a
specific character set isn't a reliable way to detect a human language
(because other characters might be correctly present) and I am agreeing (but
because the _problem_ is way more complicated than character sets).

>
> The publisher's use case was for a datatype representing those letters,
> and only those letters, used in writing the Dutch language.  Formally, of
> course, that's easy: it's an xsd:string type with a pattern facet
> consisting of "[ a-zA-Z...]+".  The question is, just what are those
> other letters represented by the ellipsis in any given case?
>
> I used the examples of "façade" and "coöperate" and "naïve" to
> illustrate that this problem may or may not have a clear-cut answer.
These
> are not foreign words; they are standard spellings (though not the only
> standard spellings) of standard English words.
>
> It's perfectly true that a sentence like "Al-Musa said, '<insert
> Arabic here>'." is also an English sentence even if the Arabic text
> is expressed in the Arabic script.  But that isn't my point.

It's another good point however. What I am saying is that there are lots of
good reasons why what was suggested might not be reliable (either false
positives or false negatives).

>
> > Indeed to reliably detect an English sentence the 'recognizer' needs to
> > understand how to form words from characters and sentences from words.
This
> > is way outside the capabilities of the XML schema definition languages
we
> > have been discussing.
>
> Of course, of course.  But even at the level of characters, there is
> a *definitional* (not implementation) problem in saying just what
> the character repertoire of <insert language here> is.
> Many have come up against this rock and crashed against it.

Agreed.

Jonathan





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS