XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
A SAX TransformerHandler encoding question

Hi,

I've get some interesting problems with JDK's (1.4 and 1.5) 
TransformerHandler and surrogate pairs...:

Consider:

   public void testOut() throws Exception {
     ByteArrayOutputStream out = new ByteArrayOutputStream();
     SAXTransformerFactory stf = (SAXTransformerFactory) 
SAXTransformerFactory.newInstance();

     TransformerHandler th = stf.newTransformerHandler();
 
th.getTransformer().setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, 
"yes");
     th.setResult(new StreamResult(out));

     th.startDocument();
     th.startElement("", "foo", "foo", new AttributesImpl());
     char c[] = "\udc00\ud800".toCharArray();
     th.characters(c, 0, c.length);
     th.endElement("", "foo", "foo");
     th.endDocument();

     byte bytes[] = out.toByteArray();

     for (int i = 0; i < bytes.length; i++) {
       System.out.println(i + ": " + bytes[i] + " " + ((char)bytes[i]));
     }
   }

This yields:

0: 60 <
1: 102 f
2: 111 o
3: 111 o
4: 62 >
5: -19 ?
6: -80 ?
7: -128 ?
8: -19 ?
9: -96 ?
10: -128 ?
11: 60 <
12: 47 /
13: 102 f
14: 111 o
15: 111 o
16: 62 >

That is, the surrogate pair has been serialized as two separate unicode 
characters. It seems that this problem is old (see 
<http://issues.apache.org/jira/browse/XALANJ-2132>), so why does it 
still occur in recent JDKs?

Best regards, Julian


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS