OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: Char & Java implementation

[ Lists Home | Date Index | Thread Index ]
  • From: Richard Tobin <richard@cogsci.ed.ac.uk>
  • To: Jeni Tennison <jft@Psychology.Nottingham.AC.UK>, xml-dev@ic.ac.uk
  • Date: Wed, 4 Mar 1998 10:52:48 GMT

> [2]  Char ::= #x9 | #xA | #xD | [#x20-#xD7FF]
>               | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
>                                   ^^^^^^^^^^^^^^^^^^
> 
> Am I right in thinking that, since the indicated characters are longer than
> 16 bits, they can't be represented in Java with the char data type, and int
> must be used instead?

The answer to this explains the otherwise mysterious missing range
D800 to DFFF.  These 2 * 2^10 missing characters can be used in pairs
to represent the first 2^20 characters above FFFF.  The character
10000 + x is represented by the pair D800 + (x >> 10), DC00 + (x & 3FF).

Since none of the characters above FFFF are name characters, they are
irrelevant to the syntax of XML, and you don't need to convert the
pairs of "surrogates" into the characters they represent - you can
just pass them through to the application.

So you can treat the range of legal characters as being 9,A,D,20-FFFD.

There are a few things you have to take account of:

- the surrogates must appear in pairs in the input, one in the range
  D800-DBFF followed by one in the range DC00-DFFF

- if a character entity refers to a character in the range 10000-10FFFF
  it should be converted to a pair of surrogates before it is passed to
  the application

- a character entity must not expand to a character in the surrogate
  range D800-DFFF.

I think, but I'm not certain, that this encoding only applies to UTF-16
and not UCS-2 (which would mean that the surrogate characters are an
error if encountered in a UCS-2 stream).  Can anyone confirm/deny this?

-- Richard

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS