[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
URI references and UTF-8 based escaping
- From: Mike Brown <firstname.lastname@example.org>
- To: "'email@example.com'" <firstname.lastname@example.org>
- Date: Thu, 11 Jan 2001 13:22:00 -0700
Section 4.2.2 of the XML 1.0 Recommendation (2nd Ed.) states that:
1. a SystemLiteral is a URI reference
2. a URI reference is defined by RFCs 2396 and 2732
It then goes on to provide informative information about URI syntax. It
mentions UTF-8 based escaping of non-ASCII characters.
However, RFCs 2396 and 2372 do not mandate UTF-8 based escaping. In fact,
the decision about how to handle non-ASCII characters and how to communicate
that information is left to the scheme specifications. (ref: RFC 2396 sec
2.1, toward the end of that section).
For example, to find out how to handle non-ASCII characters in URIs that use
the http: scheme, consult the HTTP specification. The URN spec mandates
UTF-8 for urn: schemes, but this is not applicable to URIs in general. The
HTTP spec does not address the issue at all, nor does HTML. Consequently,
you'll find URL-encoding that is based on non-Unicode encodings,
particularly in submissions of HTML form data from the major browsers.
XML 1.0 (2nd Ed.) Errata E4 says:
Replace the last sentence of the paragraph beginning
with "URI references require encoding and escaping of
certain characters." with the following: "The XML
processor must escape disallowed characters as follows:"
This clarifies that UTF-8 based escaping is required for the processing of
SystemLiterals by XML parsers, and thus a SystemLiteral is a URI reference
that always uses UTF-8 based escaping, rather than what the appropriate
scheme spec may mandate or implicitly allow.
Here is a scenario that illustrates how the assumption of UTF-8 based
escaping could conflict with the URI spec's deference to the scheme specs:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE mydoc [
<!ELEMENT mydoc (#PCDATA)>
<!ENTITY greeting SYSTEM
The name César is represented here as C%C3%A9sar in the UTF-8 based
escaping. But the getgreeting resource at http://somewhere/getgreeting is
iso-8859-1 centric (as it is allowed to be) and is expecting to be able to
interpret the escaped characters as iso-8859-1, not UTF-8 (since HTTP
doesn't care). It returns an entity containing a localized greeting phrase,
having interpreted the %C3%A9 as U+00C3 U+00A9:
<?xml version="1.0" encoding="iso-8859-1"?>
...and thus you end up with the contents of the mydoc element having César's
In practice, I don't think it's a major issue, but it's something to be
aware of. As always, please tell me if I'm full of crap. Thanks.
Mike J. Brown, software engineer at My XML/XSL resources:
webb.net in Denver, Colorado, USA http://skew.org/xml/