[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Historical I18n Note

From: "Bullard, Claude L (Len)" <clbullar@ingr.com>
To: Tim Bray <tbray@textuality.com>, xml-dev@lists.xml.org
Date: Mon, 16 Jul 2001 14:25:22 -0500
Yes Tim, you have explained your position on this before based 
your background as a consultant and programmer. 

 http://lists.w3.org/Archives/Public/w3c-sgml-wg/1996Nov/0141.html

"I have always explained all the benefits of SGML (ISO, 
vendor-independent, platform-independent, content not presentation,
you know the drill).  When I do that, I almost always get the 
Sounds Good Maybe Later response: "SGML is this great big 
complicated technology and we're going to have to hire consultants 
and buy huge expensive pieces of software and it won't work with the 
Web."  I sometimes feel that that the SGML community is unaware how 
prevalent this mind-set is.  I've always argued against this, but
have felt to some degree like I'm swimming up-hill.

Lately, I have also been explaining that there is an SGML starter-kit 
called XML, which is small, lightweight (I wave a printout of
the draft spec at them), easy to understand, and designed to work on 
the web.  But you still get data safety and constrained-authoring 
because it's SGML." 

So, SGML still is where we start when looking for extensibility 
solutions that keep us in the standards world and protected 
from privatization.  It isn't always the case that programmer 
sensibility dominates requirements. When management 
issues are worked out, that is, what is a business decision 
and what is done for the good of "principles", one evaluates 
all solutions, reckons costs, then picks a value. 

Extensibility by character set description is the issue.  SGML provides 
for this in the SGML Declaration with character set descriptions. 
Systems with frozen SGML Declarations (eg. XML), make this 
unavailable.  At this layer of SGML, XML is 
non-extensible as it hides the Declaration or says in effect, 
none exists.  The extensible solution is nasty, but 
so is the cost of the do-overs based on limited perspectives 
about evolutionary requirements.  It does exist insofar 
as XML is an SGML subset, so let's look at it.

Just for reasons of historical accuracy, let's see what 
a text available said at the time and perhaps answer 
Mike's query about what SGML might provide to XML.

From Martin Bryan, SGML:  An Author's Guide to the Standard 
Generalized Markup Language.  Because you bring it up, 
let's start with SDATA:

"It is sometimes necessary to use system-specific information in the 
replacement text for entities.  To allow receiving programs to 
identify expansions that they may not handle in exactly the 
same way on each system, the reserved word SDATA can be used 
to identify entity declarations containing system-specific 
replacement text.  For example the entity declaration used to 
identify the AE ligature might be:

<!ENTITY AElig SDATA "[AElig]" --=capital AE dipthong (ligature) -->

In this case, the program will expand &AElig; to give [AElig] which the 
text formatter will recognize as the coding that generates the 
character AE.  When this declaration is sent to another system, 
owever, the SDATA reserved word can be recognized and the receiving 
program can ask its operator to provide the coding needed to generate 
the relevant replacement character(s).

It should be noted, however, that while characters defined as valid 
in the document's character set but as invalid in the document's 
concrete syntax can be included in SDATA entities, non-SGML characters 
that have been declared as unused in the document's character set 
cannot be specified in an SDATA declaration...

(4) specific character data (SDATA) entities that contain characters 
whose role is specific to the local system.

Where the retrieved entity contains data that is not coded in SGML, 
(ie, consists of non-SGML characters, non-parsable character data 
or system-specific information), the entity must be declared as a 
data entity.  This is indicated by placing the appropriate reserved 
word (CDATA, NDATA, or SDATA) immediately after the system identifier
(or the word SYSTEM if no identifier is present) followed by a compulsory 
notation name identifying the type of coding used within the data entity.

...

When the system has finished sending the decoded data to the document, 
it will transmit a special, system dependent, signal known as an 
entity end signal to the SGML parser.  This signal is output by the 
system at the end of each entity to tell the parser that it can 
continue processing the rest of the text.

Note:  The entity end signal is not a control code and need not be 
one of the codes declared within the document's character set.  It 
can be any signal or group of signals recognized by the SGML 
program as an indication that the end of an entity's replacement 
text has been received.

Where an external entity contains character data or other 
system-specific information, it's declaration must be 
qualified by a suitable notation name:

<!ENTITY special SYSTEM "b:logotype.174" SDATA "logo" >
<!NOTATION logo SYSTEM "logo generation subsystem" >"

While SDATA is interesting in its own right, the more applicable 
part of the SGML Declaration is the document character set 
clause that enables a document to contain characters 
that are not defined in the document's concrete syntax.  
This uses the reserved name 

CHARSET 

followed by one or more character set descriptions. Again 
from Martin Bryan: 

"Each character set description consists of a base character 
set statement followed by a described character set 
portion identifying the roles of individual characters. 

More than one reference (base) character set can be used 
to build up a character set description...

When using the document character set clause to create 
a translation table for an incoming document it is important 
to remember that character references to reassigned codes 
will also need to be changed during translation.  For example, 
if a document prepared ... is to be transferred to an 
EBCIDIC-based system, an ISO 646 character reference such as 
$#34; in an entity declaration will need to be changed to 
&#125, the EBCIDIC code for a quotation mark."

Ok, now, which parts of that are hard and expensive?  Feel 
free to fill in details I missed.


Len 
http://www.mp3.com/LenBullard

Ekam sat.h, Vipraah bahudhaa vadanti.
Daamyata. Datta. Dayadhvam.h


-----Original Message-----
From: Tim Bray [mailto:tbray@textuality.com]
Sent: Monday, July 16, 2001 12:45 PM
To: xml-dev@lists.xml.org
Subject: Historical I18n Note


At 08:16 AM 16/07/01 -0500, Bullard, Claude L (Len) wrote:
>This is easy.  SGML preserves options using the SGML 
>Declaration.  The options have costs and require skill 
>to handle.  XML is simpler but it removes the options. 
>In SGML, Blueberry is a non-issue

Sorry, Len has now said this about 8 times and just for
reasons of historical accuracy, I have to make the point
that i18n in the SGML context was not quite the bundle
of sweetness-and-light that's presented here.  Anyone
who's ever tried to 

(a) understand what an SDATA entity is, and/or
(b) take a file full of them produced by Vendor A and
    try to figure out how to get them rendered on screen
    or paper by software from another vendor

will know what I'm talking about.  SGML handles these 
issues *in principle* fully & completely by abstracting 
away the notion of a character.  SGML handled a lot of 
issues in principle.  XML's decision to say "a character 
is an atomic unit of text as defined by Unicode, and you
have to support at least these 2 bit encodings" has less
abstract beauty but it's there for a reason, and it buys 
a huge amount of real-world interoperability that no 
previous markup-language system, including SGML, ever 
came close to. -Tim
Follow-Ups:
- RE: Historical I18n Note
  - From: Tony Graham <Tony.Graham@ireland.sun.com>
Prev by Date: RE: Conversion of XDR and XSD
Next by Date: Re: DTD Notation raises a question
Previous by thread: RE: Metadata I18N
Next by thread: RE: Historical I18n Note
Index(es):
- Date
- Thread