XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Allowed characters for NCName

XML allows "native language markup", which is where element and attribute names can by and large use the typical graphical characters used for words in any native language. So Chinese can have element names entirely with ideographic characters and so on. (The only proviso is that spaces and apostrophe's are not possible.)

This idea was adopted into XML following the ERCS principles of the CJK DOCP group, an ISO-liaison expert group made up from standards people, industry and academics from East Asia, in the mid 1990s (I wrote it.) After XML, the principles have been consolidated by W3C and Unicode in a joint technical report concerning characters suitable for use in markup.

So Turkish dotless i is certainly allowed as an XML name character. (In fact, it is also the main reason why XML is case-sensitive, IIRC: it means there is no nation-neutral case-mapping strategy for A-Z.)

One possible reason there may a complaint about that character is if you are using the wrong encoding declaration. Your document should be using UTF-8, or 8859-9 (or 8859-3, or CP1254 etc). Many character sets do not have enough redundant code-points to allow incorrect labeling to be determined (for example, between the 8859-n character sets). So the strict naming rules of XML 1.0 serve as a back up to detect when code comes through that is not allowed as part of a name: it is a sign that there has been a bug or data corruption and prevents further infection.

When looking at character encoding, the golden rule is USE A HEX EDITOR. Don't open a file in some vanilla text editor unless you are really clear what encodings it reads, how it handles fonts, and what input mappings it may perform.

Cheers
Rick Jelliffe


Desmond Kirrane wrote:
5585ca8d0712130334k111cb306wa5f13b91c3518c70@mail.gmail.com" type="cite">Hi,

In my xml docs I have an atrribute of type xs:NCName.

When validating the xml against a schema the Turkish lower case i Character: is not allowed in the attribute.

From the XML Schema recommendation here http://www.w3.org/TR/xmlschema-2/#NCName
i know that:

NCName ::= (Letter | '_') (NCNameChar)*
NCNameChar ::= Letter | Digit | '.' | '-' | '_' | CombiningChar | Extender

My questions are:
1. Is Letter = any letter in the English Alphabet (of any case)?
2. What are the CombiningChar(s)?
3. What are the Extender(s)?
4 Obviously Digit = numbers (0-9).



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS