xml-dev - Re: [xml-dev] Specifying a Unicode subset

Re: [xml-dev] Specifying a Unicode subset

[ Lists Home | Date Index | Thread Index ]

To: Gustaf Liljegren <gustaf.liljegren@xml.se>
Subject: Re: [xml-dev] Specifying a Unicode subset
From: tblanchard@mac.com
Date: Mon, 21 Oct 2002 18:25:16 +0200
Cc: xml-dev@lists.xml.org
In-reply-to: <3.0.6.32.20021021180358.0098f730@m1.858.telia.com>

What *exactly* do you hope to accomplish?

Because I'm not seeing any value at all here and as a programmer I feel 
like you're compelling me to stare into the fires of hell.

Unicode has arrived to kill off all of the short sighted legacy 
character encodings and while unicode has a *lot* of problems for asian 
languages (Han unification was *NOT* a good idea), it remains 
infinitely better than the tower of Babel we had before.

Besides, there are good libraries 
(http://oss.software.ibm.com/developerworks/opensource/icu/project/) 
for dealing with internationalization and the legacy encodings and once 
they are done I hope never to revisit this nightmare again.

Anybody building any kind of development environment that does not take 
advantage of this extensive body of code is a fool who deserves interop 
with nothing more than his navel.

Lets move on.  UTF-8 is your transfer encoding, use UCS-2 in memory 
(unless planning to process ancient Sumerian or something - then use 
UCS-4) and lets all move on to something remotely interesting.

On Monday, October 21, 2002, at 06:03  PM, Gustaf Liljegren wrote:

> One thing I remember from SGML is the flexibility it allows in 
> defining the
> character repertoire and even map characters from a BASESET to a 
> DESCSET.
> While there are many longtime SGML users here, there are probably many
> without this experience too, so here's a quick review:
>
> In the SGML declaration (that's a file apart from the document and the 
> DTD
> with settings for a certain application), you first declare a BASESET, 
> that
> closely resembles the characters you'll use. The BASESET is given by a 
> name
> which is understood by the system:
>
> BASESET "ISO 646:1983//CHARSET ..."
>
> The information carried in this string is a numbered character 
> repertoire
> (a.k.a. coded character set, or CCS). ASCII is one numbered character
> repertoire, where the number 65 is assigned to the character 'A'. 
> EBCDIC is
> another, where the character 'A' is assigned the number 193.
>
> In a DESCSET you map characters encountered in the document to 
> positions in
> the BASESET. So if you parse a document using EBCDIC and it encounters 
> a
> character numbered 193, it may be mapped automatically to 65, if your 
> tools
> prefer ASCII:
>
> DESCSET   193     1     65
>
> This means you map 1 character in the document, starting at position 
> 193,
> to character position 65 in the BASESET. You can map several chacters 
> at
> the same time, by increasing the number in the middle. The last number 
> may
> be set to 'UNUSED' to indicate that the parser should exclude 
> characters
> with these numbers:
>
> DESCSET     0     9     UNUSED  -- 0 to 8 are not used --
>
> Today, everyone seem to support the idea of one true CCS (Unicode).
> Therefore, with XML we don't have the kind of problem illustrated in 
> the
> first DESCSET example; a character number can have only one meaning in 
> XML.
> However, there's no way to specify which characters to include or 
> exclude
> in XML, as illustrated in the second example.
>
> With XML 1.1 (here's my point), there's a proposal to include more
> characters from Unicode in XML. So while people nowadays agree on 
> which CCS
> to use, there's still discussion about which *part* of that CCS should 
> be
> included in XML. Maybe XML needs a more flexible solution?
>
> I see three aspects in this:
>
> 1. Which CCS is used?
> 2. Which subset from the CCS is used?
> 3. Which algoritm is used to encode character numbers to binary 
> sequences?
>
> As far as I'm concerned, it's a good thing that XML clearly specifies 
> the
> unconditional use of Unicode as its CCS. By doing so, XML removes one 
> level
> of complexity and most of the character conversion headaches.
>
> The third aspect, if I'm not mistaken, is exactly what is specified in 
> the
> 'encoding' attribute in the XML declaration. That is good too.
>
> However, some want more characters in XML, while others don't want 
> them.
> Perhaps we can allow for both by letting documents declare their own 
> subset
> of Unicode?
>
> <?xml version="1.0" encoding="iso-8859-1"?>
> <?xml-characters plain="add_nel.xml" charref="add_c0.xml"?>
> <doc>
>   <p><!-- Unicode characters, some not standard in XML --></p>
> </doc>
>
> The PI would point to one or two files that (one way or the other)
> specifies a subset of Unicode. The 'plain' subset is for characters 
> that
> may be written directly (i.e. acts as a replacement for the 'Char'
> production in the specification). The 'charref' subset is for 
> characters
> that may be represented as character entities.
>
> I need help in understanding the implications of this solution. Would 
> it
> break something fundamental?
>
> Gustaf
>
>
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>
>

Follow-Ups:
- Re: [xml-dev] Specifying a Unicode subset
  - From: Gustaf Liljegren <gustaf.liljegren@xml.se>
- Re: [xml-dev] Specifying a Unicode subset
  - From: John Cowan <jcowan@reutershealth.com>

References:
- Specifying a Unicode subset
  - From: Gustaf Liljegren <gustaf.liljegren@xml.se>

Prev by Date: What is an "element"?
Next by Date: Re: [xml-dev] Specifying a Unicode subset
Previous by thread: Specifying a Unicode subset
Next by thread: Re: [xml-dev] Specifying a Unicode subset
Index(es):
- Date
- Thread