XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Doesn't the list of allowable characters shown in theXML specification assume a Unicode character encoding scheme? What if theXML isn't using Unicode?

Hey, Roger,

XML is a stream of XML characters (per the spec) or codepoints, to be 
more precise. So, there is no such thing as an XML document, 
post-parse, that is anything other than a stream or array of Unicode 
codepoints. A parser that accepts (one of) the EBCDIC encoding(s) as 
input converts (either really, if it's running on a machine that uses a 
different codeset, or theoretically to conform to the spec) the EBCDIC 
input to Unicode. Likewise, output is just serialization of the (either 
actual unicode or platform-specific charset mapped-to-unicode) to 
whatever the (supported) target encoding is.

But it's all defined as unicode, so before you can reason about XML, 
you have to turn the (presumably serialized) stream of not-unicode 
characters into unicode (or you can have a platform-native XML tool, in 
some cases, but it conceptually operates over unicode codepoints, if 
it's an XML tool).

Amy!
On Thu, 15 Apr 2021 12:51:38 +0000, Roger L Costello wrote:
> Hi Folks,
> 
> The XML specification says that these are the codepoints for the 
> characters that are allowed in XML documents:
> 
> Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
> [#x10000-#x10FFFF]	
> 
> But, but, but, ....
> 
> Doesn't that list of codepoints assume the XML documents are encoded 
> using a Unicode character encoding scheme? 

It's not an assumption, it's a requirement.

> What if the XML documents aren't encoded using a Unicode character 
> encoding scheme, then what are the allowable characters? 
> 
> For example, in Unicode the codepoint #x9 corresponds to the 
> "horizontal tab" character but in EBCDIC hex 9 corresponds to the 
> "begin superscript" character. Is the XML specification saying that 
> an XML document using EBCDIC can use the invisible "begin 
> superscript" character but not the "horizontal tab" character? Or, is 
> it saying that am I expected, when using a character encoding scheme 
> other than Unicode, to convert the above list of Unicode codepoints 
> to the corresponding characters in the non-Unicode character encoding 
> scheme? For example, in EBCDIC the "horizontal tab" character is 5.


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS