Lists Home |
Date Index |
From: "Jonathan Borden" <email@example.com>
> I suppose that if I went to the trouble to specify "text-en" that I probably
> wouldn't want that to validate. Come to think of it, the French would
> probably pay good money to obtain a reliable validator that fails on words
> that smack of English, so that pattern * - text-en (or something akin) might
> become quite popular :-))
The company Alis has tools which they say can reliably detect many
different languages (and even some encodings) based on statistics.
But, again, the point of validating a character repertoire would be to
assert which characters *are* expected, so that you can be told
when an unexpected character is found and so that programmers
The issue of what deciding "What is in English?" or "What is in French?"
is a red herring. An English language document may well have a
an unmarked greek character, for example. By being able to validate
that, say, only ASCII characters are used for English, we force the
special character to be marked up specially, or we alert the typesetter
or whatever that the data contains something that a programmer was
told not to expect.
Another example might be in Chinese. A military document type
for the Taiwanese army might say, for example, that only characters
found in Big5 or only characters learnt as part of end of year 10 should
be allowed in the body text of training manuals, to correspond to
baseline literacy of conscripts.
Very few fonts have all Unicode characters. And with good reason:
fonts are large and high-quality publishing fonts will often come
from regional type foundaries: we wouldn't expect that Chinese font from
a Singaporean font foundary will support Polish orthography well or
have Arabic characters, for example. The modern trend, spearheaded
from Asia and now part of Java, is to have virtual fonts, where you
mix an match ranges of existing fonts.
So again it comes down to what a schema is for. If it to express
the static and dynamic constraints that a given production flow
requires to be checked for high quality operation, then
things like range-checking mixed content is something that
*some* schema module should do.