[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
(char)0 handling proposal
- From: Brendan Macmillan <bren@mail.csse.monash.edu.au>
- To: xml-dev@lists.xml.org
- Date: Fri, 17 Aug 2001 10:32:41 +1000
Hi,
I just joined the list to ask about a � issue, and the first three
posts I see are about it! Serendipity!
Is there any standard convention for representing a character of value 0 in XML
(and other control characters)? I understand that we can't actually *have*
such a character - that's why � is illegal - but sometimes we want to output
data that includes such characters. (I'm thinking Java, which doesn't use the
nul char as a string terminator.)
Is there a convention for doing this (albeit informal)?
It's not really "binary data", but the rare control characters that sometimes
appear in strings that are otherwise mostly printable characters.
Below is a (short) essay on the problem, and possible solutions. I'd most
welcome further comment/critcism. ;-)
Cheers,
Brendan
-----------------------------------------------------------------------------
Here's the problem:
How can we represent (char)0 and control characters in XML, in a way
that standard XML tools (like SAX and DOM) can read them?
Here's the proposal:
Encode them Java style, like this: \u0000
Here's more detail:
JSX needs to represent characters of 0 value (and other control
characters), because Java permits them to occur in Strings. In
practice, they rarely occur there - but are very common in
StringBuffers for example, where they pad out the unused portion.
Because JSX needs to be able to map *all* objects to XML, it needs
to be able to do this.
But XML doesn't allow 0 characters - and the "Ӓ" and "Ӓ"
syntax explicitly forbids "�". Of course, JSX has the option of
encoding it in any way that it can read it back in - for example, at
present it simply writes the control characters directly as is.
But we want it to be legitimate XML, so it can interoperate with other
tools, such as SAX and DOM and XSLT and so on.
This is a real issue!
Note that the problem is not exactly "binary data", like a set of
pixel values for an image. For that, an array of bytes might be
more approriate. But for Strings, most of the characters are regular
human readable characters - it will usually be only a few that are
control characters or (char)0 etc.
Some potential solutions to encode individual (char)0 are:
(1). external unparsed entities
(2). introduce a new scheme to XML, like Java's \u0000
(3). use a different range of Unicode characters for this purpose
(4). treat the whole String as containing binary data
(1). "External unparsed entities"
---------------------------------
An external unparsed entity (this can appear as an
attribute alue, if the DTD specifies it to be so) - but SAX and DOM
won't know that they are supposed to include it in the document.
We could have a list of all the chars, like nul, bel, etc
This is how SAX would deal with reading it: when it is notified of an
external unparsed reference, it needs to read in the appropriate
value (would need to have a list of what they mean).
A limitation is that external unparsed entities can only be referenced
from within an attribute (not embedded within a String) - and
furthermore, the type of the attribute must be ENTITY or ENTITIES.
To use this scheme, every possible character would need an unparsed
entity - since chars are two bytes, that's 2**16 or 65536 possible
values!
(2). A New Scheme, like \u0000
------------------------------
Include our own proprietary encoding scheme, like: \u0000 - but it's
not XML.
(3). shift range of Unicode characters
--------------------------------------
Represent the ASCII control chars (ie 0x00 - 0x1F) with chars
permitted in XML (eg: 0x7F - 0x9F).
"Encodings using the *bytes* 0x7F to 0x9F aren't the issue. What
counts here is the Unicode *characters* U+007F to U+009F, which are
solely the control characters."
Citation:
http://lists.xml.org/archives/xml-dev/200006/msg00502.html
The big problem with this approach is how to encode characters which
were already in the range 0x7F - 0x9F... it might not happen often,
but a bijective mapping (ie reversible) needs to be able to handle
all cases!
(4). Treat as Binary Data
-------------------------
This approach is probably keeping more in spirit with XML: if a
String or StringBuffer etc contains *any* control characters, it
is no longer character data (from the point of view of XML), but
really is "binary data". Therefore, encode it as such - for example,
treat it as an array of short: each char can become a short (both are
stored in two bytes), with some kind of markup saying it should be
converted back into an array of char. Thus, <ArrayOf-char ... />
becomes <ArrayOf-short reallyChar="true" ... />.
But this raises an interesting issue... after all, the whole ArrayOf
convention is an invention of JSX - why should we worry about other
XML conventions, if we are happy to make up this one? Aha! The
key thing is that the ArrayOf convention is built *on top* of the
XML conventions, and is consistent with them. SAX and DOM can read
them in fine, even though they don't know what to do with them - that
is, it takes additional code to parse them fully (handling ArrayOf
etc).
Thus, important factors in how to handle (char)0 is to build on top
of the XML conventions (consistently); and with a scheme that is
easy to understand and to write code to parse and unpack them fully.
Which scheme is easier to parse? Let's review the three choices in
this light:
(1). extern unparsed entities don't seem too bad; though can't be
embedded in Strings
(2). the "\u0000" is also not too bad: just need to check all char
data (including String) for \u0000 (etc), and if present, convert
to a char of that value. This may be a bit inefficient, since
SAX will have already done this kind of test for & etc. It
can be embedded in Strings.
(3). A shifted range is very easy to parse back; and it can be
embedded in Strings.
(4). binary data is a little complex, and the code would need to
understand how JSX handles arrays in some depth: if reallyChar's
value true, then cast the remaining attriutes to char. It can't be
embedded in Strings.
OK! We've looked at 4 different possibilities, and considered what
factors are important in the choice. Here's a conclusion:
It seems that \u0000 would be best, because:
- it is *obvious* what this means to any (Java-aware) human.
- It is easy to write a parser for it.
- It can be embedded in the middle of a String.
- It only affects the parts of the String that are "binary" - the
rest is still rendered as perfectly readable text, instead of the
whole thing being treated as binary (not one apple spoiling the lot!).
- It doesn't require any extra mucking about (like a DTD, or
strange variation on String encoding, or an initial pass on the
entire String to check if it does contain any binary data etc).
Here's a sketch of an implementation for JSX:
To encode:
(1). If in control char range, need to convert to hex, and then
output exactly 4 chars... [is there an existing Java method for this?]
and output preceeded by "\u".
(2). If "\", then write "\\" - need to escape the escape char!)
To decode:
(1). if see a "\", followed by a "u", then grab the next four
characters, parse them as an int, and cast to char. [is there a
sign-unsigned issue here?]
(2). if see a "\", followed by a "\" then return a '\'
We'd put these in with the same code that presently encode and
decode the "&" etc.
This is quite exciting! As always, your thoughts are not only
welcome, but actively sought and requested!
Hope all this wasn't too much of an ordeal to get through!