[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Historical I18n Note

From: Tony Graham <Tony.Graham@ireland.sun.com>
To: xml-dev@lists.xml.org
Date: Wed, 18 Jul 2001 11:34:52 +0100 (BST)
Bullard, Claude L (Len) wrote at 17 Jul 2001 11:18:18 -0500:
 > Latin liturgies sound like incantations if one isn't trained 
 > for Latin.  It doesn't absolve us from trying our best to 

I guess I shouldn't have included that line about incantations, since
my main point was that the level of support for arbitrary character
sets among SGML parsers was mixed, to put it mildly.

Of course, there was neither the emphasis on nor the knowledge of
multiple character sets when SGML was designed or when most of the
SGML parsers were written.  The SGML Declaration's character set
definitions really only became useful with large character sets when
it adopted the ERCS proposed by Rick Jelliffe (although I and many
other people processed a lot of Chinese, Japanese, and Korean SGML on
8-bit clean SGML parsers without the SGML Declaration being any the
wiser).  SGML was designed with ISO 2022 in mind -- see the
definitions of MSOCHAR, MSICHAR, MSSCHAR, and FUNCHAR in the SGML
Handbook -- which, in a way, would make SGML well suited for Internet
protocols that use ISO 2022-based character sets, but the current
interpretation of SGML normalises all of that character set switching
before the characters are compared against the document character set,
so the SGML Declaration deals with abstract character numbers (scalar
values, in Unicode terms), not the numeric value of the bytes used to
encode the characters.

 > explore this aspect of the foundations of XML to inquire 
 > if we might improve it given what is emerging from member 
 > requirements in the context of asking 
 > what might be useful in the SGML Declaration for XML.  It 
 > being obvious that XML has to be changed and now the debate 
 > is how this should be done, one might ask:
 > 
 > 1.  Should the XML SGML Declaration be real and be open 
 > to use by XML developers?  Do we go forward only by 

No.  There's too much stuff that you would never change, because
changing it would break XML interoperability.

When I described the SGML Declaration for XML 1.0 in my book, I
covered the character set stuff and omitted the rest as not relevant
to SGML systems that support Unicode.

 > building new Blueberry-capable parsers, or do we 
 > solve the problem once using SGML facilities more 
 > deliberately?  It would prudent to go to a level of 
 > applying the standard that is deeper than the infamous 
 > rejoinder from an XML father to the SGML father, 
 > "I have my own ideas about how standards should be 
 > used."  That isn't smart.
 > 
 > 2.  Should some portion of that remain closed?

You shouldn't use it.  Since I've never understood why the SGML
Declaration isn't written in SGML, I think a hypothetical SGML
Declaration equivalent for XML should be written in XML.  I don't
think you can convince many people of the need for a new SGML
Declaration for XML, and I don't think that you could convince many of 
those to use something that isn't itself XML.

 > 3.  Could some portion of it be used for requirements 
 > such as Blueberry presents? 

You might use some of the ideas that an SGML Declaration represents,
but its syntax is appalling.

 > 4.  Should information about how the declaration could be 
 > improved be fed back to ISO as part of the review of SGML 
 > to improve it such that it may better work with XML?  

An SGML Declaration is capable of expressing naming rules that
Blueberry proposes, but it seems to me that you can't add &#x85; (NEXT
LINE) as a line delimiter alongside &#xA; (LINE FEED) and &#xD;
(CARRIAGE RETURN) because you can assign only one character to the RS
(Record start character) role and one to the RE (Record end character) 
role, and those are currently assigned to &#xA; and &#xD;,
respectively.  You could, however, declare &#x85; as a SEPCHAR
(Separator character) alongside &#x9; (HORIZONTAL TABULATION) for much 
the same effect.

 > In other words, pertaining to Leigh Dodd's question as to 
 > is this XML pulling away from SGML, it may be the case that 
 > XML as as subset now has lessons learned that ought to be 
 > folded back into SGML to converge the international standard 
 > and the consortium specification.

There's nothing particularly significant about having to change the
set of characters that are allowed in names.

Supporting three line separator characters when there's only two
record separator character roles might be a problem, but it remains to 
be seen whether a majority of the people who can decide the question
for XML think that having three line separator characters is necessary 
for XML.

 > Bryan states that the variant concrete syntax declarations 
 > are the way to respond when a system not based on the International 
 > Reference Version (IRV) character set defined in ISO 646 is used 
 > thus requiing alterations to the SYNTAX clause of the SGML 
 > Declaration.  Three ways are provided:
 > 
 > 1.  in the SYNTAX clause of the SGML Declaration, a public 
 > concrete syntax is specified (itself, a variant concrete syntax)  

That just saves space in the SGML Declaration, since what you would
put in the SYNTAX clause is now in an external file (or built into the 
SGML parser).  Only the SYNTAX clause that would differ between XML
1.0 and Blueberry, so you'd end up with separate SGML Declaration
files that refer to separate syntax files.

 > 2.  Use the SWITCHES option to modify the reference concrete 
 > syntax (or another publicly declared syntax)

No.  SWITCHES changes the role of a specific character number.  For
both name characters and line delimiters, Blueberry proposes adding
more characters, but you can't switch in a new name character, for
example, without switching out an old one.

 > 3.  Completely redefine the SYNTAX clause.  Bryan provides 
 > an example of an alternative syntax-reference character 
 > set description for EBCDIC that changes the reference 
 > concrete syntax.

That's what you'd have to do.

 > This makes use of public identifiers.  I am curious if a 
 > URI based identifier might be used if a stable external 
 > file format were provided such as you mention if formal 
 > is set to NO in the features clause.

The SGML Declaration has always identified things by name, not by
location (where the ISO 2022 escape sequences in CHARSET identifiers
are really just an alternative name, I suppose).  Also, identifiers in
the SGML declaration are currently limited to "minimum literals",
which is a different set of characters to those allowed in URLs.

 > Also, what about the SYSTEM declarations?

And you thought SGML Declarations weren't widely understood!

 > Using a SYSTEM declaration we see something such as 
 > Martin Bryan's example:
 > 
 > SCOPE Instance <!-- indicates system can handle more than one syntax at a
 > time -->
 > 
 > SYNTAX PUBLIC "ISO 8879-1986//SYNTAX Reference//EN"
 >         CHANGES SWITCHES
 > SYNTAX  PUBLIC "ISO-1986//SYNTAX MULTICODE Basic//EN"
 > SYNTAX  PUBLIC "+//IBM//SYNTAX EBCDIC//EN"
 >         CHANGES  DELIMLEN 3
 >         SEQUENCE YES
 >         SRCNT    100
 >         SRLEN    10

If you wrote separate syntax clauses for XML 1.0 and Blueberry and
gave them separate identifiers, then an XML processor that wanted to
behave like a SGML parser could provide a System Declaration that
stated which syntax clauses it supported.

The System Declaration, even more so than the SGML Declaration, is
meant for people to read, since if you don't read the System
Declaration and give the SGML system something that the software can't
support, the software will just choke and die.

Over the years, people have proposed various schemes for documenting
the capabilities of XML processors that have all reminded me of SGML's
System Declaration, and indicating Blueberry support or lack of it is
probably best left to such an XML mechanism because there's a lot of
stuff in a System Declaration that will never change for XML and that
is of absolutely no interest to someone checking on Blueberry support.

 > I don't want to trivialize the difficulty.  On the other hand, 
 > I don't want to see a Blueberry pop up every two years and 
 > find out "oops, we need yet more of SGML or we need to 
 > reinvent SGML" or "those HAN characters just aren't business 
 > requirements so...".  

Yes, you can describe post-Blueberry XML using a SGML Declaration
(although you might need to fudge on &#x85;), but since there's so
much stuff in a SGML Declaration that will never change for XML, I
question why you'd want to add parsing SGML Declarations to all XML
processors.

As John Cowan pointed out in a post a while ago, in SGML you can now
refer to a SGML Declaration rather than having to include the SGML
Declaration in the input stream the way that you used to.  (I haven't
actually seen that implemented by any SGML parser, but nor have I
looked very hard.)  If you really wanted to base post-Blueberry XML on
a post-Blueberry SGML Declaration, then you could standardise the
identifier for the post-Blueberry SGML Declaration and include the
SGML Declaration reference in every post-Blueberry XML file (which
would certainly be sufficient to stop XML 1.0 processors from using
the file).  The post-Blueberry SGML Declaration could be assumed to be 
built in to the XML processor (or obtainable by dereferencing the
name, for systems that care to implement it that way).

...
 > -----Original Message-----
 > From: Tony Graham [mailto:Tony.Graham@ireland.sun.com]
 > Sent: Tuesday, July 17, 2001 10:21 AM
 > To: xml-dev@lists.xml.org
 > Subject: RE: Historical I18n Note
...
 > Also, the definition of numeric character references such as &#34; and 
 > &#125; has been subject to reinterpretation in recent years: numeric
 > character references are evaluated in terms of the syntax reference
 > character set, not the document character set, which is why you can
 > use &#38; to represent '&' in any XML or HTML document no matter what
 > encoding you are using.

Oops, wrong.  What I should have said (prompted by a post by Lars
Marius Garshol on the Unicode mailing list) is that the numeric
character references are to characters in the document character set,
but whatever "character encoding" or "storage representation of
characters" that you use is able to be mapped to whatever character
representation that the SGML system cares to use that can represent
every character in your document character set.  &#38;, no matter what 
document is appears in, refers to character number 38 in the document
character set.  What bit or byte value the SGML system uses internally 
to represent character number 38 isn't your concern, just as you don't 
have to worry about what internal representation your XML processor
uses for characters.

Regards,


Tony Graham
------------------------------------------------------------------------
Tony Graham                           mailto:tony.graham@ireland.sun.com
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708
References:
- RE: Historical I18n Note
  - From: "Bullard, Claude L (Len)" <clbullar@ingr.com>
Prev by Date: RE: XML under JDK 1.0 and spacecraft
Next by Date: XML file repository on the web
Previous by thread: RE: Historical I18n Note
Next by thread: RE: Historical I18n Note
Index(es):
- Date
- Thread