[
Lists Home |
Date Index |
Thread Index
]
- To: xml-dev@lists.xml.org
- Subject: Re: Unicode and attribute URI values?
- From: "Felix Sasaki" <fsasaki@w3.org>
- Date: Fri, 16 Sep 2005 15:58:18 +0900
- Organization: W3C
- User-agent: Opera M2/8.0 (Win32, build 7561)
Hi,
This is an reply to the following message:
> As part of designing a digital publication open standard (OpenReader),
> we're now discussing the issue of allowed characters within URI
> attribute values in UTF-8 encoded XML documents.
> Reading XML 1.0 and RFC 3986, it is not at all clear (at least to me)
> what is allowed, or how much leeway exists. Specifically, when the
> attribute URI value includes non-ASCII characters (e.g., Greek
> characters), must these non-ASCII characters be percent-encoded in the
> attribute value (effectively "ascii-zing" the attribute value), or can
> the characters be kept natively encoded in the attribute value per the
> text encoding of the document?
> I guess this issue comes under the moniker "International URIs".
> Thanks.
> Jon Noring
Do you know RFC 3987? This is called "Internationalized Resource
Identifiers" (IRI) and addresses maybe many of your problems.
http://www.ietf.org/rfc/rfc3987.txt
Section 6.3 of RFC 3987 says:
Document formats that transport URIs may have to be upgraded to allow
the transport of IRIs. In cases where the document as a whole has a
native character encoding, IRIs MUST also be encoded in this
character encoding and converted accordingly by a parser or
interpreter. IRI characters not expressible in the native character
encoding SHOULD be escaped by using the escaping conventions of the
document format if such conventions are available. Alternatively,
they MAY be percent-encoded according to section 3.1. For example, in
HTML or XML, numeric character references SHOULD be used. If a
document as a whole has a native character encoding and that
character encoding is not UTF-8, then IRIs MUST NOT be placed into
the document in the UTF-8 character encoding.
Note: Some formats already accommodate IRIs, although they use
different terminology. HTML 4.0 [HTML4] defines the conversion from
IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink
[XLink], XML Schema [XMLSchema], and specifications based upon them
allow IRIs. Also, it is expected that all relevant new W3C formats
and protocols will be required to handle IRIs [CharMod].
So to answer your question (it is not at all clear (at least to me) what
is allowed, or how much leeway exists.): It depends on the specific XML
application what is allowed and what not, also whether e.g. escaping is
necessary. Some of the applications rely on the escaping rules described
in section 3.1 of RFC 3987.
Hope that helps. Best,
Felix
|