[
Lists Home |
Date Index |
Thread Index
]
What *exactly* do you hope to accomplish?
Because I'm not seeing any value at all here and as a programmer I feel
like you're compelling me to stare into the fires of hell.
Unicode has arrived to kill off all of the short sighted legacy
character encodings and while unicode has a *lot* of problems for asian
languages (Han unification was *NOT* a good idea), it remains
infinitely better than the tower of Babel we had before.
Besides, there are good libraries
(http://oss.software.ibm.com/developerworks/opensource/icu/project/)
for dealing with internationalization and the legacy encodings and once
they are done I hope never to revisit this nightmare again.
Anybody building any kind of development environment that does not take
advantage of this extensive body of code is a fool who deserves interop
with nothing more than his navel.
Lets move on. UTF-8 is your transfer encoding, use UCS-2 in memory
(unless planning to process ancient Sumerian or something - then use
UCS-4) and lets all move on to something remotely interesting.
On Monday, October 21, 2002, at 06:03 PM, Gustaf Liljegren wrote:
> One thing I remember from SGML is the flexibility it allows in
> defining the
> character repertoire and even map characters from a BASESET to a
> DESCSET.
> While there are many longtime SGML users here, there are probably many
> without this experience too, so here's a quick review:
>
> In the SGML declaration (that's a file apart from the document and the
> DTD
> with settings for a certain application), you first declare a BASESET,
> that
> closely resembles the characters you'll use. The BASESET is given by a
> name
> which is understood by the system:
>
> BASESET "ISO 646:1983//CHARSET ..."
>
> The information carried in this string is a numbered character
> repertoire
> (a.k.a. coded character set, or CCS). ASCII is one numbered character
> repertoire, where the number 65 is assigned to the character 'A'.
> EBCDIC is
> another, where the character 'A' is assigned the number 193.
>
> In a DESCSET you map characters encountered in the document to
> positions in
> the BASESET. So if you parse a document using EBCDIC and it encounters
> a
> character numbered 193, it may be mapped automatically to 65, if your
> tools
> prefer ASCII:
>
> DESCSET 193 1 65
>
> This means you map 1 character in the document, starting at position
> 193,
> to character position 65 in the BASESET. You can map several chacters
> at
> the same time, by increasing the number in the middle. The last number
> may
> be set to 'UNUSED' to indicate that the parser should exclude
> characters
> with these numbers:
>
> DESCSET 0 9 UNUSED -- 0 to 8 are not used --
>
> Today, everyone seem to support the idea of one true CCS (Unicode).
> Therefore, with XML we don't have the kind of problem illustrated in
> the
> first DESCSET example; a character number can have only one meaning in
> XML.
> However, there's no way to specify which characters to include or
> exclude
> in XML, as illustrated in the second example.
>
> With XML 1.1 (here's my point), there's a proposal to include more
> characters from Unicode in XML. So while people nowadays agree on
> which CCS
> to use, there's still discussion about which *part* of that CCS should
> be
> included in XML. Maybe XML needs a more flexible solution?
>
> I see three aspects in this:
>
> 1. Which CCS is used?
> 2. Which subset from the CCS is used?
> 3. Which algoritm is used to encode character numbers to binary
> sequences?
>
> As far as I'm concerned, it's a good thing that XML clearly specifies
> the
> unconditional use of Unicode as its CCS. By doing so, XML removes one
> level
> of complexity and most of the character conversion headaches.
>
> The third aspect, if I'm not mistaken, is exactly what is specified in
> the
> 'encoding' attribute in the XML declaration. That is good too.
>
> However, some want more characters in XML, while others don't want
> them.
> Perhaps we can allow for both by letting documents declare their own
> subset
> of Unicode?
>
> <?xml version="1.0" encoding="iso-8859-1"?>
> <?xml-characters plain="add_nel.xml" charref="add_c0.xml"?>
> <doc>
> <p><!-- Unicode characters, some not standard in XML --></p>
> </doc>
>
> The PI would point to one or two files that (one way or the other)
> specifies a subset of Unicode. The 'plain' subset is for characters
> that
> may be written directly (i.e. acts as a replacement for the 'Char'
> production in the specification). The 'charref' subset is for
> characters
> that may be represented as character entities.
>
> I need help in understanding the implications of this solution. Would
> it
> break something fundamental?
>
> Gustaf
>
>
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>
>
|