OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: Unicode and attribute URI values?

[ Lists Home | Date Index | Thread Index ]
  • To: xml-dev@lists.xml.org
  • Subject: Re: Unicode and attribute URI values?
  • From: "Felix Sasaki" <fsasaki@w3.org>
  • Date: Fri, 16 Sep 2005 15:58:18 +0900
  • Organization: W3C
  • User-agent: Opera M2/8.0 (Win32, build 7561)

Hi,

This is an reply to the following message:

> As part of designing a digital publication open standard (OpenReader),
> we're now discussing the issue of allowed characters within URI
> attribute values in UTF-8 encoded XML documents.

> Reading XML 1.0 and RFC 3986, it is not at all clear (at least to me)
> what is allowed, or how much leeway exists. Specifically, when the
> attribute URI value includes non-ASCII characters (e.g., Greek
> characters), must these non-ASCII characters be percent-encoded in the
> attribute value (effectively "ascii-zing" the attribute value), or can
> the characters be kept natively encoded in the attribute value per the
> text encoding of the document?

> I guess this issue comes under the moniker "International URIs".

> Thanks.

> Jon Noring

Do you know RFC 3987? This is called "Internationalized Resource  
Identifiers" (IRI) and addresses maybe many of your problems.

http://www.ietf.org/rfc/rfc3987.txt

Section 6.3 of RFC 3987 says:

    Document formats that transport URIs may have to be upgraded to allow
    the transport of IRIs.  In cases where the document as a whole has a
    native character encoding, IRIs MUST also be encoded in this
    character encoding and converted accordingly by a parser or
    interpreter.  IRI characters not expressible in the native character
    encoding SHOULD be escaped by using the escaping conventions of the
    document format if such conventions are available. Alternatively,
    they MAY be percent-encoded according to section 3.1. For example, in
    HTML or XML, numeric character references SHOULD be used.  If a
    document as a whole has a native character encoding and that
    character encoding is not UTF-8, then IRIs MUST NOT be placed into
    the document in the UTF-8 character encoding.

    Note: Some formats already accommodate IRIs, although they use
    different terminology.  HTML 4.0 [HTML4] defines the conversion from
    IRIs to URIs as error-avoiding behavior.  XML 1.0 [XML1], XLink
    [XLink], XML Schema [XMLSchema], and specifications based upon them
    allow IRIs.  Also, it is expected that all relevant new W3C formats
    and protocols will be required to handle IRIs [CharMod].

So to answer your question (it is not at all clear (at least to me) what  
is allowed, or how much leeway exists.): It depends on the specific XML  
application what is allowed and what not, also whether e.g. escaping is  
necessary. Some of the applications rely on the escaping rules described  
in section 3.1 of RFC 3987.

Hope that helps. Best,

Felix




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS