[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Historical I18n Note

From: Tony Graham <Tony.Graham@ireland.sun.com>
To: xml-dev@lists.xml.org
Date: Tue, 17 Jul 2001 16:21:16 +0100 (BST)

Bullard, Claude L (Len) wrote at 16 Jul 2001 14:25:22 -0500:
 > While SDATA is interesting in its own right, the more applicable 
 > part of the SGML Declaration is the document character set 
 > clause that enables a document to contain characters 
 > that are not defined in the document's concrete syntax.  
 > This uses the reserved name 
 > 
 > CHARSET 
 > 
 > followed by one or more character set descriptions. Again 
 > from Martin Bryan: 
 > 
 > "Each character set description consists of a base character 
 > set statement followed by a described character set 
 > portion identifying the roles of individual characters. 
 > 
 > More than one reference (base) character set can be used 
 > to build up a character set description...
 > 
 > When using the document character set clause to create 
 > a translation table for an incoming document it is important 
 > to remember that character references to reassigned codes 
 > will also need to be changed during translation.  For example, 
 > if a document prepared ... is to be transferred to an 
 > EBCIDIC-based system, an ISO 646 character reference such as 
 > $#34; in an entity declaration will need to be changed to 
 > &#125, the EBCIDIC code for a quotation mark."
 > 
 > Ok, now, which parts of that are hard and expensive?  Feel 
 > free to fill in details I missed.

Yes, the document character set is defined in terms of characters from
one or more base character sets, but your SGML system works by mapping
the characters in those base characters sets to characters in the (one
or more) base character sets that are referenced in the "syntax
reference character set" later in the SGML Declaration.  Actually, in
the syntax portion of the SGML Declaration, you assign roles to
character numbers, and each character number is equated to a character
in a base character set, then in the document character set portion
you define the character numbers that can be used in your document and
map them to characters in a base character set (I'm ignoring
characters defined in term of minimum literals).  The whole thing
works because of the correspondence of characters in the two lots of
base character sets.

The interesting thing is that there was never great agreement on how
to specify the base character sets.  At least one SGML parser worked
with only the character sets that it could recognise from (the decimal
representation of) the charset's ISO 2022 escape sequence in the
charset identifier, and while OmniMark and nsgmls let you map from the
charset identifier to an external file describing the character set,
they each used a different file format for the external file.

So, aside from the fact that character set definitions in the SGML
declaration are incantations to most people, a "novel" character set
definition in a SGML declaration is not necessarily portable.

Also, the definition of numeric character references such as &#34; and 
&#125; has been subject to reinterpretation in recent years: numeric
character references are evaluated in terms of the syntax reference
character set, not the document character set, which is why you can
use &#38; to represent '&' in any XML or HTML document no matter what
encoding you are using.

Regards,


Tony Graham
------------------------------------------------------------------------
Tony Graham                           mailto:tony.graham@ireland.sun.com
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708

References:
- RE: Historical I18n Note
  - From: "Bullard, Claude L (Len)" <clbullar@ingr.com>

Prev by Date: Re: SAX2 ... missing features?
Next by Date: CapeStudio Early Access Version Available
Previous by thread: RE: Historical I18n Note
Next by thread: RE: Historical I18n Note
Index(es):
- Date
- Thread