OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

URI references and UTF-8 based escaping

Section 4.2.2 of the XML 1.0 Recommendation (2nd Ed.) states that:

1. a SystemLiteral is a URI reference
2. a URI reference is defined by RFCs 2396 and 2732

It then goes on to provide informative information about URI syntax. It
mentions UTF-8 based escaping of non-ASCII characters.

However, RFCs 2396 and 2372 do not mandate UTF-8 based escaping. In fact,
the decision about how to handle non-ASCII characters and how to communicate
that information is left to the scheme specifications. (ref: RFC 2396 sec
2.1, toward the end of that section).

For example, to find out how to handle non-ASCII characters in URIs that use
the http: scheme, consult the HTTP specification. The URN spec mandates
UTF-8 for urn: schemes, but this is not applicable to URIs in general. The
HTTP spec does not address the issue at all, nor does HTML. Consequently,
you'll find URL-encoding that is based on non-Unicode encodings,
particularly in submissions of HTML form data from the major browsers. 

XML 1.0 (2nd Ed.) Errata E4 says:

   Replace the last sentence of the paragraph beginning
   with "URI references require encoding and escaping of
   certain characters." with the following: "The XML
   processor must escape disallowed characters as follows:"

This clarifies that UTF-8 based escaping is required for the processing of
SystemLiterals by XML parsers, and thus a SystemLiteral is a URI reference
that always uses UTF-8 based escaping, rather than what the appropriate
scheme spec may mandate or implicitly allow.

Here is a scenario that illustrates how the assumption of UTF-8 based
escaping could conflict with the URI spec's deference to the scheme specs:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE mydoc [
   <!ELEMENT mydoc (#PCDATA)>
   <!ENTITY greeting SYSTEM

The name César is represented here as C%C3%A9sar in the UTF-8 based
escaping. But the getgreeting resource at http://somewhere/getgreeting is
iso-8859-1 centric (as it is allowed to be) and is expecting to be able to
interpret the escaped characters as iso-8859-1, not UTF-8 (since HTTP
doesn't care). It returns an entity containing a localized greeting phrase,
having interpreted the %C3%A9 as U+00C3 U+00A9:

<?xml version="1.0" encoding="iso-8859-1"?>
ˇHola, CĂ©sar!

...and thus you end up with the contents of the mydoc element having César's
name misspelled.

In practice, I don't think it's a major issue, but it's something to be
aware of. As always, please tell me if I'm full of crap. Thanks.

   - Mike
Mike J. Brown, software engineer at            My XML/XSL resources: 
webb.net in Denver, Colorado, USA              http://skew.org/xml/