[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Historical I18n Note
- From: "Bullard, Claude L (Len)" <clbullar@ingr.com>
- To: Tony Graham <Tony.Graham@ireland.sun.com>, xml-dev@lists.xml.org
- Date: Tue, 17 Jul 2001 11:18:18 -0500
Thanks for responding, Tony. I have to open up the books to
sort this out given that the SGML Declaration was for me
and most SGML users, I suspect, a black box. When I've had
to tweak one in the past, it usually wasn't character set
descriptions.
Latin liturgies sound like incantations if one isn't trained
for Latin. It doesn't absolve us from trying our best to
explore this aspect of the foundations of XML to inquire
if we might improve it given what is emerging from member
requirements in the context of asking
what might be useful in the SGML Declaration for XML. It
being obvious that XML has to be changed and now the debate
is how this should be done, one might ask:
1. Should the XML SGML Declaration be real and be open
to use by XML developers? Do we go forward only by
building new Blueberry-capable parsers, or do we
solve the problem once using SGML facilities more
deliberately? It would prudent to go to a level of
applying the standard that is deeper than the infamous
rejoinder from an XML father to the SGML father,
"I have my own ideas about how standards should be
used." That isn't smart.
2. Should some portion of that remain closed?
3. Could some portion of it be used for requirements
such as Blueberry presents?
4. Should information about how the declaration could be
improved be fed back to ISO as part of the review of SGML
to improve it such that it may better work with XML?
In other words, pertaining to Leigh Dodd's question as to
is this XML pulling away from SGML, it may be the case that
XML as as subset now has lessons learned that ought to be
folded back into SGML to converge the international standard
and the consortium specification.
Bryan states that the variant concrete syntax declarations
are the way to respond when a system not based on the International
Reference Version (IRV) character set defined in ISO 646 is used
thus requiing alterations to the SYNTAX clause of the SGML
Declaration. Three ways are provided:
1. in the SYNTAX clause of the SGML Declaration, a public
concrete syntax is specified (itself, a variant concrete syntax)
2. Use the SWITCHES option to modify the reference concrete
syntax (or another publicly declared syntax)
3. Completely redefine the SYNTAX clause. Bryan provides
an example of an alternative syntax-reference character
set description for EBCDIC that changes the reference
concrete syntax.
This makes use of public identifiers. I am curious if a
URI based identifier might be used if a stable external
file format were provided such as you mention if formal
is set to NO in the features clause.
Also, what about the SYSTEM declarations?
Using a SYSTEM declaration we see something such as
Martin Bryan's example:
SCOPE Instance <!-- indicates system can handle more than one syntax at a
time -->
SYNTAX PUBLIC "ISO 8879-1986//SYNTAX Reference//EN"
CHANGES SWITCHES
SYNTAX PUBLIC "ISO-1986//SYNTAX MULTICODE Basic//EN"
SYNTAX PUBLIC "+//IBM//SYNTAX EBCDIC//EN"
CHANGES DELIMLEN 3
SEQUENCE YES
SRCNT 100
SRLEN 10
I don't want to trivialize the difficulty. On the other hand,
I don't want to see a Blueberry pop up every two years and
find out "oops, we need yet more of SGML or we need to
reinvent SGML" or "those HAN characters just aren't business
requirements so...".
Len
http://www.mp3.com/LenBullard
Ekam sat.h, Vipraah bahudhaa vadanti.
Daamyata. Datta. Dayadhvam.h
-----Original Message-----
From: Tony Graham [mailto:Tony.Graham@ireland.sun.com]
Sent: Tuesday, July 17, 2001 10:21 AM
To: xml-dev@lists.xml.org
Subject: RE: Historical I18n Note
Bullard, Claude L (Len) wrote at 16 Jul 2001 14:25:22 -0500:
> While SDATA is interesting in its own right, the more applicable
> part of the SGML Declaration is the document character set
> clause that enables a document to contain characters
> that are not defined in the document's concrete syntax.
> This uses the reserved name
>
> CHARSET
>
> followed by one or more character set descriptions. Again
> from Martin Bryan:
>
> "Each character set description consists of a base character
> set statement followed by a described character set
> portion identifying the roles of individual characters.
>
> More than one reference (base) character set can be used
> to build up a character set description...
>
> When using the document character set clause to create
> a translation table for an incoming document it is important
> to remember that character references to reassigned codes
> will also need to be changed during translation. For example,
> if a document prepared ... is to be transferred to an
> EBCIDIC-based system, an ISO 646 character reference such as
> $#34; in an entity declaration will need to be changed to
> }, the EBCIDIC code for a quotation mark."
>
> Ok, now, which parts of that are hard and expensive? Feel
> free to fill in details I missed.
Yes, the document character set is defined in terms of characters from
one or more base character sets, but your SGML system works by mapping
the characters in those base characters sets to characters in the (one
or more) base character sets that are referenced in the "syntax
reference character set" later in the SGML Declaration. Actually, in
the syntax portion of the SGML Declaration, you assign roles to
character numbers, and each character number is equated to a character
in a base character set, then in the document character set portion
you define the character numbers that can be used in your document and
map them to characters in a base character set (I'm ignoring
characters defined in term of minimum literals). The whole thing
works because of the correspondence of characters in the two lots of
base character sets.
The interesting thing is that there was never great agreement on how
to specify the base character sets. At least one SGML parser worked
with only the character sets that it could recognise from (the decimal
representation of) the charset's ISO 2022 escape sequence in the
charset identifier, and while OmniMark and nsgmls let you map from the
charset identifier to an external file describing the character set,
they each used a different file format for the external file.
So, aside from the fact that character set definitions in the SGML
declaration are incantations to most people, a "novel" character set
definition in a SGML declaration is not necessarily portable.
Also, the definition of numeric character references such as " and
} has been subject to reinterpretation in recent years: numeric
character references are evaluated in terms of the syntax reference
character set, not the document character set, which is why you can
use & to represent '&' in any XML or HTML document no matter what
encoding you are using.
Regards,
Tony Graham
------------------------------------------------------------------------
Tony Graham mailto:tony.graham@ireland.sun.com
Sun Microsystems Ireland Ltd Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3 x(70)19708
------------------------------------------------------------------
The xml-dev list is sponsored by XML.org <http://www.xml.org>, an initiative
of OASIS <http://www.oasis-open.org>
The list archives are at http://lists.xml.org/archives/xml-dev/
To unsubscribe from this elist send a message with the single word
"unsubscribe" in the body to: xml-dev-request@lists.xml.org