[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Doesn't the list of allowable characters shown in the XML specification assume a Unicode character encoding scheme? What if the XML isn't using Unicode?
- From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
- To: Roger L Costello <costello@mitre.org>,"xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
- Date: Thu, 15 Apr 2021 10:04:52 -0400
At 2021-04-15 12:51 +0000, Roger L Costello wrote:
The XML specification says that these are the codepoints for the
characters that are allowed in XML documents:
Not quite.
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]
But, but, but, ....
Doesn't that list of codepoints assume the XML documents are encoded
using a Unicode character encoding scheme?
The specification says that a "parsed entity contains text, a
sequence of characters", and "a character is an atomic unit of text
as specified by ISO/IEC 10646. Legal characters are tab, character
return, line feed, and legal characters of Unicode and ISO/IEC 10646".
https://www.w3.org/TR/xml/#charsets
Separately, 4.3.3 states "In the document entity, the encoding
declaration is part of the XML declaration".
https://www.w3.org/TR/xml/#charencoding
What if the XML documents aren't encoded using a Unicode character
encoding scheme, then what are the allowable characters?
The encoding of the document entity is independent of the repertoire
of allowable characters. If the document entity expresses a character
that is not in the list of allowable characters, then the document is
not well-formed.
For example, in Unicode the codepoint #x9 corresponds to the
"horizontal tab" character but in EBCDIC hex 9 corresponds to the
"begin superscript" character. Is the XML specification saying that
an XML document using EBCDIC can use the invisible "begin
superscript" character but not the "horizontal tab" character? Or,
is it saying that am I expected, when using a character encoding
scheme other than Unicode, to convert the above list of Unicode
codepoints to the corresponding characters in the non-Unicode
character encoding scheme? For example, in EBCDIC the "horizontal
tab" character is 5.
Neither. The specification is saying that a document entity has an
encoding that is independent of the definition of the text allowed in
XML parsed entities. To get the character you want in XML (as defined
by Unicode) use the encoding you need in your document (as defined by
the XML Declaration).
If you try to say it using your own words, you may end up confusing
the reader. I suggest you cite the specification.
I hope this is helpful.
. . . . . Ken
/Roger
_______________________________________________________________________
XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.
[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
--
Contact info, blog, articles, etc. http://www.CraneSoftwrights.com/x/ |
Check our site for free XML, XSLT, XSL-FO and UBL developer resources |
Streaming hands-on XSLT/XPath 2 training class @US$125 (5 hours free) |
Essays (UBL, XML, etc.) http://www.linkedin.com/today/author/gkholman |
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]