Re: [xml-dev] Doesn't the list of allowable characters shown in theXML s

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] Doesn't the list of allowable characters shown in theXML specification assume a Unicode character encoding scheme? What if theXML isn't using Unicode?

From: Amelia A Lewis <amyzing@talsever.com>
To: Roger L Costello <costello@mitre.org>
Date: Thu, 15 Apr 2021 09:18:29 -0400

Hey, Roger,

XML is a stream of XML characters (per the spec) or codepoints, to be 
more precise. So, there is no such thing as an XML document, 
post-parse, that is anything other than a stream or array of Unicode 
codepoints. A parser that accepts (one of) the EBCDIC encoding(s) as 
input converts (either really, if it's running on a machine that uses a 
different codeset, or theoretically to conform to the spec) the EBCDIC 
input to Unicode. Likewise, output is just serialization of the (either 
actual unicode or platform-specific charset mapped-to-unicode) to 
whatever the (supported) target encoding is.

But it's all defined as unicode, so before you can reason about XML, 
you have to turn the (presumably serialized) stream of not-unicode 
characters into unicode (or you can have a platform-native XML tool, in 
some cases, but it conceptually operates over unicode codepoints, if 
it's an XML tool).

Amy!
On Thu, 15 Apr 2021 12:51:38 +0000, Roger L Costello wrote:
> Hi Folks,
> 
> The XML specification says that these are the codepoints for the 
> characters that are allowed in XML documents:
> 
> Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
> [#x10000-#x10FFFF]	
> 
> But, but, but, ....
> 
> Doesn't that list of codepoints assume the XML documents are encoded 
> using a Unicode character encoding scheme? 

It's not an assumption, it's a requirement.

> What if the XML documents aren't encoded using a Unicode character 
> encoding scheme, then what are the allowable characters? 
> 
> For example, in Unicode the codepoint #x9 corresponds to the 
> "horizontal tab" character but in EBCDIC hex 9 corresponds to the 
> "begin superscript" character. Is the XML specification saying that 
> an XML document using EBCDIC can use the invisible "begin 
> superscript" character but not the "horizontal tab" character? Or, is 
> it saying that am I expected, when using a character encoding scheme 
> other than Unicode, to convert the above list of Unicode codepoints 
> to the corresponding characters in the non-Unicode character encoding 
> scheme? For example, in EBCDIC the "horizontal tab" character is 5.

References:
- Doesn't the list of allowable characters shown in the XMLspecification assume a Unicode character encoding scheme? What if the XMLisn't using Unicode?
  - From: Roger L Costello <costello@mitre.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]