Re: [xml-dev] Allowed characters for NCName

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Rick Jelliffe <rjelliffe@allette.com.au>
To: Desmond Kirrane <desmond.kirrane@googlemail.com>
Date: Fri, 14 Dec 2007 01:27:27 +0000

XML allows "native language markup", which is where element and attribute names can by and large use the typical graphical characters used for words in any native language. So Chinese can have element names entirely with ideographic characters and so on. (The only proviso is that spaces and apostrophe's are not possible.)

This idea was adopted into XML following the ERCS principles of the CJK DOCP group, an ISO-liaison expert group made up from standards people, industry and academics from East Asia, in the mid 1990s (I wrote it.) After XML, the principles have been consolidated by W3C and Unicode in a joint technical report concerning characters suitable for use in markup.

So Turkish dotless i is certainly allowed as an XML name character. (In fact, it is also the main reason why XML is case-sensitive, IIRC: it means there is no nation-neutral case-mapping strategy for A-Z.)

One possible reason there may a complaint about that character is if you are using the wrong encoding declaration.� Your document should be using UTF-8, or 8859-9 (or 8859-3, or CP1254 etc).� Many character sets do not have enough redundant code-points to allow incorrect labeling to be determined (for example, between the 8859-n character sets). So the strict naming rules of XML 1.0 serve as a back up to detect when code comes through that is not allowed as part of a name: it is a sign that there has been a bug or data corruption and prevents further infection.

When looking at character encoding, the golden rule is USE A HEX EDITOR. Don't open a file in some vanilla text editor unless you are really clear what encodings it reads, how it handles fonts, and what input mappings it may perform.

Cheers
Rick Jelliffe

Desmond Kirrane wrote:

5585ca8d0712130334k111cb306wa5f13b91c3518c70@mail.gmail.com" type="cite">Hi,

In my xml docs I have an atrribute of type xs:NCName.

When validating the xml against a schema the Turkish lower case i Character: � is not allowed in the attribute.

From the XML Schema recommendation here http://www.w3.org/TR/xmlschema-2/#NCName
i know that:

NCName �� ::= �� (Letter | '_') (NCNameChar)*
NCNameChar �� ::= �� Letter | Digit | '.' | '-' | '_' | CombiningChar | Extender

My questions are:
1. Is Letter = any letter in the English Alphabet (of any case)?
2. What are the CombiningChar(s)?
3. What are the Extender(s)?
4 Obviously Digit = numbers (0-9).

Follow-Ups:
- RE: [xml-dev] Allowed characters for NCName
  - From: "Michael Kay" <mike@saxonica.com>

References:
- Allowed characters for NCName
  - From: "Desmond Kirrane" <desmond.kirrane@googlemail.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]