OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Specifying a Unicode subset

[ Lists Home | Date Index | Thread Index ]

What *exactly* do you hope to accomplish?

Because I'm not seeing any value at all here and as a programmer I feel 
like you're compelling me to stare into the fires of hell.

Unicode has arrived to kill off all of the short sighted legacy 
character encodings and while unicode has a *lot* of problems for asian 
languages (Han unification was *NOT* a good idea), it remains 
infinitely better than the tower of Babel we had before.

Besides, there are good libraries 
for dealing with internationalization and the legacy encodings and once 
they are done I hope never to revisit this nightmare again.

Anybody building any kind of development environment that does not take 
advantage of this extensive body of code is a fool who deserves interop 
with nothing more than his navel.

Lets move on.  UTF-8 is your transfer encoding, use UCS-2 in memory 
(unless planning to process ancient Sumerian or something - then use 
UCS-4) and lets all move on to something remotely interesting.

On Monday, October 21, 2002, at 06:03  PM, Gustaf Liljegren wrote:

> One thing I remember from SGML is the flexibility it allows in 
> defining the
> character repertoire and even map characters from a BASESET to a 
> While there are many longtime SGML users here, there are probably many
> without this experience too, so here's a quick review:
> In the SGML declaration (that's a file apart from the document and the 
> with settings for a certain application), you first declare a BASESET, 
> that
> closely resembles the characters you'll use. The BASESET is given by a 
> name
> which is understood by the system:
> BASESET "ISO 646:1983//CHARSET ..."
> The information carried in this string is a numbered character 
> repertoire
> (a.k.a. coded character set, or CCS). ASCII is one numbered character
> repertoire, where the number 65 is assigned to the character 'A'. 
> another, where the character 'A' is assigned the number 193.
> In a DESCSET you map characters encountered in the document to 
> positions in
> the BASESET. So if you parse a document using EBCDIC and it encounters 
> a
> character numbered 193, it may be mapped automatically to 65, if your 
> tools
> prefer ASCII:
> DESCSET   193     1     65
> This means you map 1 character in the document, starting at position 
> 193,
> to character position 65 in the BASESET. You can map several chacters 
> at
> the same time, by increasing the number in the middle. The last number 
> may
> be set to 'UNUSED' to indicate that the parser should exclude 
> characters
> with these numbers:
> DESCSET     0     9     UNUSED  -- 0 to 8 are not used --
> Today, everyone seem to support the idea of one true CCS (Unicode).
> Therefore, with XML we don't have the kind of problem illustrated in 
> the
> first DESCSET example; a character number can have only one meaning in 
> XML.
> However, there's no way to specify which characters to include or 
> exclude
> in XML, as illustrated in the second example.
> With XML 1.1 (here's my point), there's a proposal to include more
> characters from Unicode in XML. So while people nowadays agree on 
> which CCS
> to use, there's still discussion about which *part* of that CCS should 
> be
> included in XML. Maybe XML needs a more flexible solution?
> I see three aspects in this:
> 1. Which CCS is used?
> 2. Which subset from the CCS is used?
> 3. Which algoritm is used to encode character numbers to binary 
> sequences?
> As far as I'm concerned, it's a good thing that XML clearly specifies 
> the
> unconditional use of Unicode as its CCS. By doing so, XML removes one 
> level
> of complexity and most of the character conversion headaches.
> The third aspect, if I'm not mistaken, is exactly what is specified in 
> the
> 'encoding' attribute in the XML declaration. That is good too.
> However, some want more characters in XML, while others don't want 
> them.
> Perhaps we can allow for both by letting documents declare their own 
> subset
> of Unicode?
> <?xml version="1.0" encoding="iso-8859-1"?>
> <?xml-characters plain="add_nel.xml" charref="add_c0.xml"?>
> <doc>
>   <p><!-- Unicode characters, some not standard in XML --></p>
> </doc>
> The PI would point to one or two files that (one way or the other)
> specifies a subset of Unicode. The 'plain' subset is for characters 
> that
> may be written directly (i.e. acts as a replacement for the 'Char'
> production in the specification). The 'charref' subset is for 
> characters
> that may be represented as character entities.
> I need help in understanding the implications of this solution. Would 
> it
> break something fundamental?
> Gustaf
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> The list archives are at http://lists.xml.org/archives/xml-dev/
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS