[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Historical I18n Note
- From: "Bullard, Claude L (Len)" <email@example.com>
- To: Tim Bray <firstname.lastname@example.org>, email@example.com
- Date: Mon, 16 Jul 2001 14:25:22 -0500
Yes Tim, you have explained your position on this before based
your background as a consultant and programmer.
"I have always explained all the benefits of SGML (ISO,
vendor-independent, platform-independent, content not presentation,
you know the drill). When I do that, I almost always get the
Sounds Good Maybe Later response: "SGML is this great big
complicated technology and we're going to have to hire consultants
and buy huge expensive pieces of software and it won't work with the
Web." I sometimes feel that that the SGML community is unaware how
prevalent this mind-set is. I've always argued against this, but
have felt to some degree like I'm swimming up-hill.
Lately, I have also been explaining that there is an SGML starter-kit
called XML, which is small, lightweight (I wave a printout of
the draft spec at them), easy to understand, and designed to work on
the web. But you still get data safety and constrained-authoring
because it's SGML."
So, SGML still is where we start when looking for extensibility
solutions that keep us in the standards world and protected
from privatization. It isn't always the case that programmer
sensibility dominates requirements. When management
issues are worked out, that is, what is a business decision
and what is done for the good of "principles", one evaluates
all solutions, reckons costs, then picks a value.
Extensibility by character set description is the issue. SGML provides
for this in the SGML Declaration with character set descriptions.
Systems with frozen SGML Declarations (eg. XML), make this
unavailable. At this layer of SGML, XML is
non-extensible as it hides the Declaration or says in effect,
none exists. The extensible solution is nasty, but
so is the cost of the do-overs based on limited perspectives
about evolutionary requirements. It does exist insofar
as XML is an SGML subset, so let's look at it.
Just for reasons of historical accuracy, let's see what
a text available said at the time and perhaps answer
Mike's query about what SGML might provide to XML.
From Martin Bryan, SGML: An Author's Guide to the Standard
Generalized Markup Language. Because you bring it up,
let's start with SDATA:
"It is sometimes necessary to use system-specific information in the
replacement text for entities. To allow receiving programs to
identify expansions that they may not handle in exactly the
same way on each system, the reserved word SDATA can be used
to identify entity declarations containing system-specific
replacement text. For example the entity declaration used to
identify the AE ligature might be:
<!ENTITY AElig SDATA "[AElig]" --=capital AE dipthong (ligature) -->
In this case, the program will expand Æ to give [AElig] which the
text formatter will recognize as the coding that generates the
character AE. When this declaration is sent to another system,
owever, the SDATA reserved word can be recognized and the receiving
program can ask its operator to provide the coding needed to generate
the relevant replacement character(s).
It should be noted, however, that while characters defined as valid
in the document's character set but as invalid in the document's
concrete syntax can be included in SDATA entities, non-SGML characters
that have been declared as unused in the document's character set
cannot be specified in an SDATA declaration...
(4) specific character data (SDATA) entities that contain characters
whose role is specific to the local system.
Where the retrieved entity contains data that is not coded in SGML,
(ie, consists of non-SGML characters, non-parsable character data
or system-specific information), the entity must be declared as a
data entity. This is indicated by placing the appropriate reserved
word (CDATA, NDATA, or SDATA) immediately after the system identifier
(or the word SYSTEM if no identifier is present) followed by a compulsory
notation name identifying the type of coding used within the data entity.
When the system has finished sending the decoded data to the document,
it will transmit a special, system dependent, signal known as an
entity end signal to the SGML parser. This signal is output by the
system at the end of each entity to tell the parser that it can
continue processing the rest of the text.
Note: The entity end signal is not a control code and need not be
one of the codes declared within the document's character set. It
can be any signal or group of signals recognized by the SGML
program as an indication that the end of an entity's replacement
text has been received.
Where an external entity contains character data or other
system-specific information, it's declaration must be
qualified by a suitable notation name:
<!ENTITY special SYSTEM "b:logotype.174" SDATA "logo" >
<!NOTATION logo SYSTEM "logo generation subsystem" >"
While SDATA is interesting in its own right, the more applicable
part of the SGML Declaration is the document character set
clause that enables a document to contain characters
that are not defined in the document's concrete syntax.
This uses the reserved name
followed by one or more character set descriptions. Again
from Martin Bryan:
"Each character set description consists of a base character
set statement followed by a described character set
portion identifying the roles of individual characters.
More than one reference (base) character set can be used
to build up a character set description...
When using the document character set clause to create
a translation table for an incoming document it is important
to remember that character references to reassigned codes
will also need to be changed during translation. For example,
if a document prepared ... is to be transferred to an
EBCIDIC-based system, an ISO 646 character reference such as
$#34; in an entity declaration will need to be changed to
}, the EBCIDIC code for a quotation mark."
Ok, now, which parts of that are hard and expensive? Feel
free to fill in details I missed.
Ekam sat.h, Vipraah bahudhaa vadanti.
Daamyata. Datta. Dayadhvam.h
From: Tim Bray [mailto:firstname.lastname@example.org]
Sent: Monday, July 16, 2001 12:45 PM
Subject: Historical I18n Note
At 08:16 AM 16/07/01 -0500, Bullard, Claude L (Len) wrote:
>This is easy. SGML preserves options using the SGML
>Declaration. The options have costs and require skill
>to handle. XML is simpler but it removes the options.
>In SGML, Blueberry is a non-issue
Sorry, Len has now said this about 8 times and just for
reasons of historical accuracy, I have to make the point
that i18n in the SGML context was not quite the bundle
of sweetness-and-light that's presented here. Anyone
who's ever tried to
(a) understand what an SDATA entity is, and/or
(b) take a file full of them produced by Vendor A and
try to figure out how to get them rendered on screen
or paper by software from another vendor
will know what I'm talking about. SGML handles these
issues *in principle* fully & completely by abstracting
away the notion of a character. SGML handled a lot of
issues in principle. XML's decision to say "a character
is an atomic unit of text as defined by Unicode, and you
have to support at least these 2 bit encodings" has less
abstract beauty but it's there for a reason, and it buys
a huge amount of real-world interoperability that no
previous markup-language system, including SGML, ever
came close to. -Tim