OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: SAX/C++: UTF-8 v UTF-16

[ Lists Home | Date Index | Thread Index ]
  • From: James Clark <jjc@jclark.com>
  • To: David Megginson <david@megginson.com>
  • Date: Fri, 03 Dec 1999 09:50:11 +0700

David Megginson wrote:

> 4. Hold my nose and use UTF-8 rather than UTF-16, for compatibility
>    with most existing C++ code.

I would say there was at least as much C++ code using UTF-16 as using
UTF-8. On Windows at least, UTF-16 is much more common. The DOM mandates
UTF-16, so if SAX mandated UTF-8 there would be an unfortunate mismatch.
This is a tough one, because there's a lot more diversity in the C++
world.  My preference would be not to mandate either UTF-8 or UTF-16
exclusively.  There are lots of apps using UTF-8 and there are lots of
apps using UTF-16; if you exclude either, then a lot of apps will take a
mojor performance/convenience hit. Expat allows a choice at compile-time
between UTF-8 and UTF-16, and there are big projects using both (eg Perl
uses UTF-8 and Mozilla uses UTF-16).

There are a couple of possible solutions:

1. A lo-tech solution.  Provide a SAXChar typedef, and define everything
in terms of SAXChar.  SAXChar gets typedefed to either char or unsigned
short depending on whether SAX_UNICODE is defined or not.  It's up to
implementations to decide whether to support both or just one, and up to
clients to decide whether to work with both or to require one.

A variation on this is to allow both UTF-8 and UTF-16 variants to exist
in a single library.  To do this, you can do something along the lines
of

class AttributeList16 {
public:
  virtual const unsigned short *getName(int pos) = 0;
};

class AttributeList8 {
public:
  virtual const char *getName(int pos) = 0;
};

#ifdef SAX_UNICODE
typedef AttributeList16 AttributeList;
#else
typedef AttributeList8 AttributeList;
#endif

2. A hi-tech solution.  Do what the Standard C++ library does and make
the interface a template in the character type.  This is the cleanest
solution, but lots of C++ projects eschew templates on portability
grounds.

If you feel that one needs to be mandated, I would pick UTF-16.

James



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)






 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS