OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Announce: XML Schema,

[ Lists Home | Date Index | Thread Index ]

From: "Jonathan Borden" <jborden@attbi.com>

> I suppose that if I went to the trouble to specify "text-en" that I probably
> wouldn't want that to validate. Come to think of it, the French would
> probably pay good money to obtain a reliable validator that fails on words
> that smack of English, so that pattern * - text-en (or something akin) might
> become quite popular  :-))

The company Alis has tools which they say can reliably detect many
different languages (and even some encodings) based on statistics.

But, again, the point of validating a character repertoire would be to 
assert which characters *are* expected, so that you can be told
when an unexpected character is found and so that programmers
can cope.  

The issue of what deciding "What is in English?" or "What is in French?"
is a red herring.  An English language document may well have a
an unmarked greek character, for example.  By being able to validate
that, say, only ASCII characters are used for English, we force the
special character to be marked up specially, or we alert the typesetter
or whatever that the data contains something that a programmer was
told not to expect. 

Another example might be in Chinese.  A military document type
for the Taiwanese army might say, for example, that only characters
found in Big5 or only characters learnt as part of end of year 10 should
be allowed in the body text of training manuals, to correspond to 
baseline literacy of conscripts. 

Very few fonts have all Unicode characters. And with good reason:
fonts are large and high-quality publishing fonts will often come
from regional type foundaries: we wouldn't expect that Chinese font from
a Singaporean font foundary will support Polish orthography well or
have Arabic characters, for example.  The modern trend, spearheaded
from Asia and now part of Java, is to have virtual fonts, where you 
mix an match ranges of existing fonts. 

So again it comes down to what a schema is for. If it to express
the static and dynamic constraints that a given production flow
requires to be checked for high quality operation, then 
things like range-checking mixed content is something that
*some* schema module should do. 

Rick Jelliffe


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS