xml-dev - RE: [xml-dev] Specifying Character Sets

RE: [xml-dev] Specifying Character Sets

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: RE: [xml-dev] Specifying Character Sets
From: Eric van der Vlist <vdv@dyomedea.com>
Date: Thu, 26 Jan 2006 10:15:18 +0100
In-reply-to: <20060126090311.13B026D00E3@gwnormandy.dyomedea.com>
Organization: Dyomedea (http://dyomedea.com)
References: <20060126090311.13B026D00E3@gwnormandy.dyomedea.com>

Hi Mike,

Le jeudi 26 janvier 2006 à 09:02 +0000, Michael Kay a écrit :
> > I am working on a small schema language for an XML language that I
> > will be using in an open source program. In this schema I 
> > am defining
> > a text data type. I want the schema developer using my 
> > schema language
> > to have the option of specifying the character set of the text data type.
> > 
> > A given XML document is only in one character set.  To 
> > support multiple 
> > character sets you'll have to do something like base64-encode 
> > the content.
> 
> I read the question differently (though people often use "character set" to
> mean "character encoding", so I might be wrong). XML allows the Unicode
> character set (or some version of it). You may want in a schema to restrict
> the user to a subset of the characters in that character set, for example
> the subset of characters defined in iso-8859-1, or the subset defined in
> iso-8859-2, or some subset of your own choosing such as [A-Z][0-9][.,-].
> 
> There are international names for character encodings such as iso-8859-1
> (search for IANA register of character sets). They define the encodings of
> the characters, which you aren't interested in, but in doing so they also
> define the repertoire of characters (that is, the character set in its
> strict meaning).

Right but to be exhaustive, I'd add that they define the repertoire of
characters that can be directly included in a XML document but still do
not prevent to add characters external to this repertoire as numeric
entities.

> I would think that a more useful approach, however, is to use the names of
> blocks of characters defined in Unicode, which are available for use in XML
> Schema regular expressions, for example <xs:pattern value="\p{IsHebrew}*"/>
> limits you to characters with Unicode codepoints 590-5FF.

Yep, except that you can't apply this constraint with W3C XML Schema
(nor with RELAX NG) to mixed content models which makes it quite useless
for a lot of real world applications.

<plug href="http://dsdl.org/"; type="shameless">

Solving this specific issue is the goal of DSDL Part 7 Character
Repertoire Description Language - CRDL and everyone interested in this
issue is welcome to help!

</plug>

Note that this restriction can be expressed with ISO Schematron using
XPath 2.0 as its expression language (or with plain XSLT 2.0).

Eric
-- 
GPG-PGP: 2A528005
Freelance consulting and training.
                                            http://dyomedea.com/english/
------------------------------------------------------------------------
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
(ISO) RELAX NG   ISBN:0-596-00421-4 http://oreilly.com/catalog/relax
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
------------------------------------------------------------------------

Ceci est une partie de message=?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e?=

Prev by Date: RE: [xml-dev] Specifying Character Sets
Next by Date: ANN: Syntext Serna WYSIWYG XML Editor V2.5.0
Previous by thread: Specifying Character Sets
Next by thread: ANN: Syntext Serna WYSIWYG XML Editor V2.5.0
Index(es):
- Date
- Thread