[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Historical I18n Note

From: "Bullard, Claude L (Len)" <clbullar@ingr.com>
To: Tony Graham <Tony.Graham@ireland.sun.com>, xml-dev@lists.xml.org
Date: Tue, 17 Jul 2001 11:18:18 -0500
Thanks for responding, Tony.  I have to open up the books to 
sort this out given that the SGML Declaration was for me 
and most SGML users, I suspect, a black box.  When I've had 
to tweak one in the past, it usually wasn't character set 
descriptions.

Latin liturgies sound like incantations if one isn't trained 
for Latin.  It doesn't absolve us from trying our best to 
explore this aspect of the foundations of XML to inquire 
if we might improve it given what is emerging from member 
requirements in the context of asking 
what might be useful in the SGML Declaration for XML.  It 
being obvious that XML has to be changed and now the debate 
is how this should be done, one might ask:

1.  Should the XML SGML Declaration be real and be open 
to use by XML developers?  Do we go forward only by 
building new Blueberry-capable parsers, or do we 
solve the problem once using SGML facilities more 
deliberately?  It would prudent to go to a level of 
applying the standard that is deeper than the infamous 
rejoinder from an XML father to the SGML father, 
"I have my own ideas about how standards should be 
used."  That isn't smart.

2.  Should some portion of that remain closed?

3.  Could some portion of it be used for requirements 
such as Blueberry presents? 

4.  Should information about how the declaration could be 
improved be fed back to ISO as part of the review of SGML 
to improve it such that it may better work with XML?  

In other words, pertaining to Leigh Dodd's question as to 
is this XML pulling away from SGML, it may be the case that 
XML as as subset now has lessons learned that ought to be 
folded back into SGML to converge the international standard 
and the consortium specification.

Bryan states that the variant concrete syntax declarations 
are the way to respond when a system not based on the International 
Reference Version (IRV) character set defined in ISO 646 is used 
thus requiing alterations to the SYNTAX clause of the SGML 
Declaration.  Three ways are provided:

1.  in the SYNTAX clause of the SGML Declaration, a public 
concrete syntax is specified (itself, a variant concrete syntax)  

2.  Use the SWITCHES option to modify the reference concrete 
syntax (or another publicly declared syntax)

3.  Completely redefine the SYNTAX clause.  Bryan provides 
an example of an alternative syntax-reference character 
set description for EBCDIC that changes the reference 
concrete syntax.

This makes use of public identifiers.  I am curious if a 
URI based identifier might be used if a stable external 
file format were provided such as you mention if formal 
is set to NO in the features clause.

Also, what about the SYSTEM declarations?

Using a SYSTEM declaration we see something such as 
Martin Bryan's example:

SCOPE Instance <!-- indicates system can handle more than one syntax at a
time -->

SYNTAX PUBLIC "ISO 8879-1986//SYNTAX Reference//EN"
        CHANGES SWITCHES
SYNTAX  PUBLIC "ISO-1986//SYNTAX MULTICODE Basic//EN"
SYNTAX  PUBLIC "+//IBM//SYNTAX EBCDIC//EN"
        CHANGES  DELIMLEN 3
        SEQUENCE YES
        SRCNT    100
        SRLEN    10

I don't want to trivialize the difficulty.  On the other hand, 
I don't want to see a Blueberry pop up every two years and 
find out "oops, we need yet more of SGML or we need to 
reinvent SGML" or "those HAN characters just aren't business 
requirements so...".  

Len 
http://www.mp3.com/LenBullard

Ekam sat.h, Vipraah bahudhaa vadanti.
Daamyata. Datta. Dayadhvam.h


-----Original Message-----
From: Tony Graham [mailto:Tony.Graham@ireland.sun.com]
Sent: Tuesday, July 17, 2001 10:21 AM
To: xml-dev@lists.xml.org
Subject: RE: Historical I18n Note


Bullard, Claude L (Len) wrote at 16 Jul 2001 14:25:22 -0500:
 > While SDATA is interesting in its own right, the more applicable 
 > part of the SGML Declaration is the document character set 
 > clause that enables a document to contain characters 
 > that are not defined in the document's concrete syntax.  
 > This uses the reserved name 
 > 
 > CHARSET 
 > 
 > followed by one or more character set descriptions. Again 
 > from Martin Bryan: 
 > 
 > "Each character set description consists of a base character 
 > set statement followed by a described character set 
 > portion identifying the roles of individual characters. 
 > 
 > More than one reference (base) character set can be used 
 > to build up a character set description...
 > 
 > When using the document character set clause to create 
 > a translation table for an incoming document it is important 
 > to remember that character references to reassigned codes 
 > will also need to be changed during translation.  For example, 
 > if a document prepared ... is to be transferred to an 
 > EBCIDIC-based system, an ISO 646 character reference such as 
 > $#34; in an entity declaration will need to be changed to 
 > &#125, the EBCIDIC code for a quotation mark."
 > 
 > Ok, now, which parts of that are hard and expensive?  Feel 
 > free to fill in details I missed.

Yes, the document character set is defined in terms of characters from
one or more base character sets, but your SGML system works by mapping
the characters in those base characters sets to characters in the (one
or more) base character sets that are referenced in the "syntax
reference character set" later in the SGML Declaration.  Actually, in
the syntax portion of the SGML Declaration, you assign roles to
character numbers, and each character number is equated to a character
in a base character set, then in the document character set portion
you define the character numbers that can be used in your document and
map them to characters in a base character set (I'm ignoring
characters defined in term of minimum literals).  The whole thing
works because of the correspondence of characters in the two lots of
base character sets.

The interesting thing is that there was never great agreement on how
to specify the base character sets.  At least one SGML parser worked
with only the character sets that it could recognise from (the decimal
representation of) the charset's ISO 2022 escape sequence in the
charset identifier, and while OmniMark and nsgmls let you map from the
charset identifier to an external file describing the character set,
they each used a different file format for the external file.

So, aside from the fact that character set definitions in the SGML
declaration are incantations to most people, a "novel" character set
definition in a SGML declaration is not necessarily portable.

Also, the definition of numeric character references such as &#34; and 
&#125; has been subject to reinterpretation in recent years: numeric
character references are evaluated in terms of the syntax reference
character set, not the document character set, which is why you can
use &#38; to represent '&' in any XML or HTML document no matter what
encoding you are using.

Regards,


Tony Graham
------------------------------------------------------------------------
Tony Graham                           mailto:tony.graham@ireland.sun.com
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708

------------------------------------------------------------------
The xml-dev list is sponsored by XML.org <http://www.xml.org>, an initiative
of OASIS <http://www.oasis-open.org>

The list archives are at http://lists.xml.org/archives/xml-dev/

To unsubscribe from this elist send a message with the single word
"unsubscribe" in the body to: xml-dev-request@lists.xml.org
Follow-Ups:
- RE: Historical I18n Note
  - From: Tony Graham <Tony.Graham@ireland.sun.com>
Prev by Date: CapeStudio Early Access Version Available
Next by Date: Re: SAX2 ... missing features?
Previous by thread: RE: Historical I18n Note
Next by thread: RE: Historical I18n Note
Index(es):
- Date
- Thread