[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] 15 elementary truths about XML
- From: John Cowan <cowan@mercury.ccil.org>
- To: Michael Kay <mike@saxonica.com>
- Date: Mon, 31 Oct 2011 15:45:55 -0400
Michael Kay scripsit:
> This raises the interesting if somewhat academic question of what XML
> would look like on a machine architecture using bytes or characters of
> a length other than 8 bits.
On the DEC PDP-10, words are 36 bits, but bytes can be any size from 1
to 36 bits. Bytes are always stored in big-endian order. The standard
representation of ASCII used 7-bit bytes, five per word with one bit
of wastage. Some kinds of text, like filenames, were stored in six
6-bit bytes by folding ASCII lower case to upper case and chopping off
the high-order bit.
To bring the PDP-10 into the Unicode age, Mark Crispin designed two new
Unicode encodings suited to its architecture. In brief, UTF-9 stores
each successive octet of a Unicode scalar value in the 8 low-order bits
of one to three nonets, using big-endian ordering. The top bit is 0 in
the final nonet and 1 in non-final nonets. UTF-18 stores the low-order
16 bits of a Unicode scalar value in the low-order 16 bits, and uses the
top two bits to encode Plane 0, Plane 1, Plane 2, or Plane 14, the other
planes being unrepresentable in this encoding. See RFC 4042 for details.
Ken Thompson once said that the reason Unix was never ported to the
PDP-10 was that there are no 9-bit magtapes.
> As far as I can see, it would be entirely conformant to use an
> encoding in which each Unicode character is mapped to a sequence of
> one or more 13-bit bytes. The only slight problem is that an XML
> parser that understands this encoding would not be conformant unless
> it also understood UTF-8 and UTF-16; and it's not entirely clear to me
> how UTF-8 and UTF-16 would look when stored on a machine with a 13-bit
> byte length.
I agree, although on such a machine it would probably be best to just
stick to octets and waste the other 5 bits. That's essentially what the
RFC recommends when you must use UTF-8 or UTF-16 on non-8-bit architectures.
--
How they ever reached any conclusion at all <cowan@ccil.org>
is starkly unknowable to the human mind. http://www.ccil.org/~cowan
--"Backstage Lensman", Randall Garrett
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]