OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Specifying a Unicode subset

[ Lists Home | Date Index | Thread Index ]

One thing I remember from SGML is the flexibility it allows in defining the
character repertoire and even map characters from a BASESET to a DESCSET.
While there are many longtime SGML users here, there are probably many
without this experience too, so here's a quick review:

In the SGML declaration (that's a file apart from the document and the DTD
with settings for a certain application), you first declare a BASESET, that
closely resembles the characters you'll use. The BASESET is given by a name
which is understood by the system:

BASESET "ISO 646:1983//CHARSET ..."

The information carried in this string is a numbered character repertoire
(a.k.a. coded character set, or CCS). ASCII is one numbered character
repertoire, where the number 65 is assigned to the character 'A'. EBCDIC is
another, where the character 'A' is assigned the number 193.

In a DESCSET you map characters encountered in the document to positions in
the BASESET. So if you parse a document using EBCDIC and it encounters a
character numbered 193, it may be mapped automatically to 65, if your tools
prefer ASCII:

DESCSET   193     1     65

This means you map 1 character in the document, starting at position 193,
to character position 65 in the BASESET. You can map several chacters at
the same time, by increasing the number in the middle. The last number may
be set to 'UNUSED' to indicate that the parser should exclude characters
with these numbers:

DESCSET     0     9     UNUSED  -- 0 to 8 are not used --

Today, everyone seem to support the idea of one true CCS (Unicode).
Therefore, with XML we don't have the kind of problem illustrated in the
first DESCSET example; a character number can have only one meaning in XML.
However, there's no way to specify which characters to include or exclude
in XML, as illustrated in the second example.

With XML 1.1 (here's my point), there's a proposal to include more
characters from Unicode in XML. So while people nowadays agree on which CCS
to use, there's still discussion about which *part* of that CCS should be
included in XML. Maybe XML needs a more flexible solution?

I see three aspects in this:

1. Which CCS is used?
2. Which subset from the CCS is used?
3. Which algoritm is used to encode character numbers to binary sequences?

As far as I'm concerned, it's a good thing that XML clearly specifies the
unconditional use of Unicode as its CCS. By doing so, XML removes one level
of complexity and most of the character conversion headaches.

The third aspect, if I'm not mistaken, is exactly what is specified in the
'encoding' attribute in the XML declaration. That is good too.

However, some want more characters in XML, while others don't want them.
Perhaps we can allow for both by letting documents declare their own subset
of Unicode?

<?xml version="1.0" encoding="iso-8859-1"?>
<?xml-characters plain="add_nel.xml" charref="add_c0.xml"?>
<doc>
  <p><!-- Unicode characters, some not standard in XML --></p>
</doc>

The PI would point to one or two files that (one way or the other)
specifies a subset of Unicode. The 'plain' subset is for characters that
may be written directly (i.e. acts as a replacement for the 'Char'
production in the specification). The 'charref' subset is for characters
that may be represented as character entities.

I need help in understanding the implications of this solution. Would it
break something fundamental?

Gustaf






 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS