[
Lists Home |
Date Index |
Thread Index
]
- To: xml-dev@lists.xml.org
- Subject: RE: [xml-dev] Specifying Character Sets
- From: Eric van der Vlist <vdv@dyomedea.com>
- Date: Thu, 26 Jan 2006 10:15:18 +0100
- In-reply-to: <20060126090311.13B026D00E3@gwnormandy.dyomedea.com>
- Organization: Dyomedea (http://dyomedea.com)
- References: <20060126090311.13B026D00E3@gwnormandy.dyomedea.com>
Hi Mike,
Le jeudi 26 janvier 2006 à 09:02 +0000, Michael Kay a écrit :
> > I am working on a small schema language for an XML language that I
> > will be using in an open source program. In this schema I
> > am defining
> > a text data type. I want the schema developer using my
> > schema language
> > to have the option of specifying the character set of the text data type.
> >
> > A given XML document is only in one character set. To
> > support multiple
> > character sets you'll have to do something like base64-encode
> > the content.
>
> I read the question differently (though people often use "character set" to
> mean "character encoding", so I might be wrong). XML allows the Unicode
> character set (or some version of it). You may want in a schema to restrict
> the user to a subset of the characters in that character set, for example
> the subset of characters defined in iso-8859-1, or the subset defined in
> iso-8859-2, or some subset of your own choosing such as [A-Z][0-9][.,-].
>
> There are international names for character encodings such as iso-8859-1
> (search for IANA register of character sets). They define the encodings of
> the characters, which you aren't interested in, but in doing so they also
> define the repertoire of characters (that is, the character set in its
> strict meaning).
Right but to be exhaustive, I'd add that they define the repertoire of
characters that can be directly included in a XML document but still do
not prevent to add characters external to this repertoire as numeric
entities.
> I would think that a more useful approach, however, is to use the names of
> blocks of characters defined in Unicode, which are available for use in XML
> Schema regular expressions, for example <xs:pattern value="\p{IsHebrew}*"/>
> limits you to characters with Unicode codepoints 590-5FF.
Yep, except that you can't apply this constraint with W3C XML Schema
(nor with RELAX NG) to mixed content models which makes it quite useless
for a lot of real world applications.
<plug href="http://dsdl.org/" type="shameless">
Solving this specific issue is the goal of DSDL Part 7 Character
Repertoire Description Language - CRDL and everyone interested in this
issue is welcome to help!
</plug>
Note that this restriction can be expressed with ISO Schematron using
XPath 2.0 as its expression language (or with plain XSLT 2.0).
Eric
--
GPG-PGP: 2A528005
Freelance consulting and training.
http://dyomedea.com/english/
------------------------------------------------------------------------
Eric van der Vlist http://xmlfr.org http://dyomedea.com
(ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
------------------------------------------------------------------------
Ceci est une partie de message=?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e?=
|