OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   RE: [xml-dev] Specifying Character Sets

[ Lists Home | Date Index | Thread Index ]
  • To: xml-dev@lists.xml.org
  • Subject: RE: [xml-dev] Specifying Character Sets
  • From: Eric van der Vlist <vdv@dyomedea.com>
  • Date: Thu, 26 Jan 2006 10:15:18 +0100
  • In-reply-to: <20060126090311.13B026D00E3@gwnormandy.dyomedea.com>
  • Organization: Dyomedea (http://dyomedea.com)
  • References: <20060126090311.13B026D00E3@gwnormandy.dyomedea.com>

Hi Mike,

Le jeudi 26 janvier 2006 à 09:02 +0000, Michael Kay a écrit :
> > I am working on a small schema language for an XML language that I
> > will be using in an open source program. In this schema I 
> > am defining
> > a text data type. I want the schema developer using my 
> > schema language
> > to have the option of specifying the character set of the text data type.
> > 
> > A given XML document is only in one character set.  To 
> > support multiple 
> > character sets you'll have to do something like base64-encode 
> > the content.
> I read the question differently (though people often use "character set" to
> mean "character encoding", so I might be wrong). XML allows the Unicode
> character set (or some version of it). You may want in a schema to restrict
> the user to a subset of the characters in that character set, for example
> the subset of characters defined in iso-8859-1, or the subset defined in
> iso-8859-2, or some subset of your own choosing such as [A-Z][0-9][.,-].
> There are international names for character encodings such as iso-8859-1
> (search for IANA register of character sets). They define the encodings of
> the characters, which you aren't interested in, but in doing so they also
> define the repertoire of characters (that is, the character set in its
> strict meaning).

Right but to be exhaustive, I'd add that they define the repertoire of
characters that can be directly included in a XML document but still do
not prevent to add characters external to this repertoire as numeric

> I would think that a more useful approach, however, is to use the names of
> blocks of characters defined in Unicode, which are available for use in XML
> Schema regular expressions, for example <xs:pattern value="\p{IsHebrew}*"/>
> limits you to characters with Unicode codepoints 590-5FF.

Yep, except that you can't apply this constraint with W3C XML Schema
(nor with RELAX NG) to mixed content models which makes it quite useless
for a lot of real world applications.

<plug href="http://dsdl.org/"; type="shameless">

Solving this specific issue is the goal of DSDL Part 7 Character
Repertoire Description Language - CRDL and everyone interested in this
issue is welcome to help!


Note that this restriction can be expressed with ISO Schematron using
XPath 2.0 as its expression language (or with plain XSLT 2.0).

GPG-PGP: 2A528005
Freelance consulting and training.
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
(ISO) RELAX NG   ISBN:0-596-00421-4 http://oreilly.com/catalog/relax
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema

Ceci est une partie de message=?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e?=


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS