[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Doesn't the list of allowable characters shown in theXML specification assume a Unicode character encoding scheme? What if theXML isn't using Unicode?
- From: Amelia A Lewis <amyzing@talsever.com>
- To: Roger L Costello <costello@mitre.org>
- Date: Thu, 15 Apr 2021 09:18:29 -0400
Hey, Roger,
XML is a stream of XML characters (per the spec) or codepoints, to be
more precise. So, there is no such thing as an XML document,
post-parse, that is anything other than a stream or array of Unicode
codepoints. A parser that accepts (one of) the EBCDIC encoding(s) as
input converts (either really, if it's running on a machine that uses a
different codeset, or theoretically to conform to the spec) the EBCDIC
input to Unicode. Likewise, output is just serialization of the (either
actual unicode or platform-specific charset mapped-to-unicode) to
whatever the (supported) target encoding is.
But it's all defined as unicode, so before you can reason about XML,
you have to turn the (presumably serialized) stream of not-unicode
characters into unicode (or you can have a platform-native XML tool, in
some cases, but it conceptually operates over unicode codepoints, if
it's an XML tool).
Amy!
On Thu, 15 Apr 2021 12:51:38 +0000, Roger L Costello wrote:
> Hi Folks,
>
> The XML specification says that these are the codepoints for the
> characters that are allowed in XML documents:
>
> Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
> [#x10000-#x10FFFF]
>
> But, but, but, ....
>
> Doesn't that list of codepoints assume the XML documents are encoded
> using a Unicode character encoding scheme?
It's not an assumption, it's a requirement.
> What if the XML documents aren't encoded using a Unicode character
> encoding scheme, then what are the allowable characters?
>
> For example, in Unicode the codepoint #x9 corresponds to the
> "horizontal tab" character but in EBCDIC hex 9 corresponds to the
> "begin superscript" character. Is the XML specification saying that
> an XML document using EBCDIC can use the invisible "begin
> superscript" character but not the "horizontal tab" character? Or, is
> it saying that am I expected, when using a character encoding scheme
> other than Unicode, to convert the above list of Unicode codepoints
> to the corresponding characters in the non-Unicode character encoding
> scheme? For example, in EBCDIC the "horizontal tab" character is 5.
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]