[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Historical I18n Note

From: Tony Graham <Tony.Graham@ireland.sun.com>
To: xml-dev@lists.xml.org
Date: Wed, 18 Jul 2001 19:19:04 +0100 (BST)
Bullard, Claude L (Len) wrote at 18 Jul 2001 09:43:35 -0500:
 > From: Tony Graham [mailto:Tony.Graham@ireland.sun.com]
...
 > >Of course, there was neither the emphasis on nor the knowledge of
 > >multiple character sets when SGML was designed or when most of the
 > >SGML parsers were written.  
 > 
 > But the abstractions "in principle" reveal some foresight in design. 

That's true, but, for example, the original need to specify character
numbers for both uppercase and lowercase forms of any characters that
you add to names (and the lack of a mechanism to specify that any of
A-Z and a-z are not allowed in names) shows that the foresight only
saw so far.  I think my previous statement still stands, but the fact
that SGML did later change to include the ERCS that made it easier to
make declarations for large, caseless character sets does show that
the SGML designers followed through on the intent of the original
design (even if maybe only one parser implemented ERCS).

 > Again, the Declaration is the ultimate escape hatch:  use wisely 
 > and with regard to costs.   CALS systems usually had to specify 
 > the Declaration in effect.  No one said it was simple but no one assumed 
 > a priori a single universal system.  I think it is that assumption 
 > by Berners-Lee et al that drives W3C design.  I think it is an 
 > optimistic assumption even if necessary.  But assuming we don't 
 > need that escape hatch is beyond optimistic and into foolhardy. 

I'm not doubting that the full generality of the SGML Declaration's
character set definition mechanism is useful, I'm just doubting that
you'll see it implemented in every XML processor.

That's not to say that everything that an SGML Declaration lets you
specify is wonderful.  I used to really like being able to do
concurrent markup (even if the one SGML parser that supported it
didn't support it according to the standard), but anyone who uses
DATATAG ("to some extent an accident of history" according to the SGML
Handbook) or RANK ("a concession to application design practices in
the early days of generic coding") needs to have a long, hard look at
their requirements.

...
 >  > 2.  Should some portion of that remain closed?
 > 
 > >You shouldn't use it.  Since I've never understood why the SGML
 > >Declaration isn't written in SGML, I think a hypothetical SGML
 > >Declaration equivalent for XML should be written in XML. 
 > 
 > It requires the reference concrete syntax.  

I think that you're confusing syntax and syntax.  The reference
concrete syntax is the default rules about what's a name character,
the maximum length of names, the maximum number of attributes on an
element, the maximum length of an attribute value, etc.

The syntax of the SGML Declaration as keywords and values separated by
whitespace was a design decision by SGML's designers.  The main
argument that I used to hear against using SGML markup in the SGML
declaration was that you would need an SGML parser to bootstrap an
SGML parser.

Right now the SGML Declaration is in a format the you have to parse
with a hardwired parser.  That hardwired parser has to recognise '<'
and '>' in the SGML Declaration because the Declaration is delimited
by them.  I've never understood why the SGML Declaration isn't some
really limited SGML markup format.  That would require a different
hardwired parser, but the SGML Declaration would have been in a form
that the people who use SGML were familiar with.

 > >I don't think you can convince many people of the need for a new SGML
 > >Declaration for XML, and I don't think that you could convince many of 
 > >those to use something that isn't itself XML.
 > 
 > Times change and so do requirements.  Today the alternative is
 > yetAnotherMagicName 
 > inside the file or to turn the names into syntax puree (relax the 
 > draconian parse).   Again, one might really want to use the standard 
 > as intended instead of how personally interpreted.   That is the 
 > Bad Thing About XML: privatization of public assets by consortia 
 > with a follow on distortion of the perception of the need for 
 > international standards.   We aren't doing ourselves 
 > or our heirs any favors with that policy or practice.   We can 
 > logically justify something based on current systems, 
 > but that won't make it right.
 > 
 >  >> 3.  Could some portion of it be used for requirements 
 >  >> such as Blueberry presents? 
 > 
 > >You might use some of the ideas that an SGML Declaration represents,
 > >but its syntax is appalling.
 > 
 > Please clarify:  the reference concrete syntax is appalling?  Why?

The reference concrete syntax is appalling because names are limited
to eight characters, you can only use A-Z, a-z, '-', and '.' in names,
and '_' isn't allowed in names.

However, what I was saying was appalling is the keywords and
whitespace nature of the SGML declaration itself.

Quick Quiz (answers below):

1. What is the correct order of SPACE, RS, and RE in the FUNCTION
   portion:

   (a) It doesn't matter
   (b) RS, RE, SPACE
   (c) RE, RS, SPACE
   (d) SPACE, RE, RS
   (e) SPACE, RS, RE

2. What does "GENERAL YES" mean:

   (a) Names are case sensitive
   (b) Names are not case sensitive

3. What is the correct order of GENERAL and ENTITY in the NAMECASE
   portion:

   (a) It doesn't matter
   (b) GENERAL, ENTITY
   (c) ENTITY, GENERAL

4. What is the correct order of the General Delimiters in the GENERAL
   portion:

   (a) It doesn't matter
   (b) It does matter but the list is too long to go into here

5. What's the difference between the two uses of GENERAL in the SGML
   Declaration?

6. What's the difference between the two uses of CHARSET in the SGML
   Declaration?

7. The SYNTAX portion starts with the SYNTAX keyword.  Where does it
   end?


FWIW, I had to look up the answers to some of my own questions, and I
used to give tutorials on this stuff.

I contend that part of why the SGML Declaration is seen as so
unapproachable is that its format is so unapproachable.  Yes, the
keywords are all eight characters or less because that's what allowed
by the reference concrete syntax, but the meanings of some of the YES
or NO options are hard to remember, as are the rules for when things
have a required order and when they don't. For many people, the stuff
in the SGML Declaration is
yetAnotherMagicName
inside the file.

 >  >> Bryan states that the variant concrete syntax declarations 
 >  >> are the way to respond when a system not based on the International 
 >  >> Reference Version (IRV) character set defined in ISO 646 is used 
 >  >> thus requiing alterations to the SYNTAX clause of the SGML 
 >  >> Declaration.  Three ways are provided:
...
 > >> 3.  Completely redefine the SYNTAX clause.  Bryan provides 
 > >> an example of an alternative syntax-reference character 
 > >> set description for EBCDIC that changes the reference 
 > >> concrete syntax.
 > 
 > >That's what you'd have to do.
 > 
 > It seems useful at the very least as the normative way to document the
 > differences.

Yes, but do you want every XML processor to have to parse and act on
that document?

 >  >> This makes use of public identifiers.  I am curious if a 
 >  >> URI based identifier might be used if a stable external 
 >  >> file format were provided such as you mention if formal 
 >  >> is set to NO in the features clause.
 > 
 > >The SGML Declaration has always identified things by name, not by
 > >location (where the ISO 2022 escape sequences in CHARSET identifiers
 > >are really just an alternative name, I suppose).  Also, identifiers in
 > >the SGML declaration are currently limited to "minimum literals",
 > >which is a different set of characters to those allowed in URLs.
 > 
 > That might be worth changing.  The URN is a name, so enabling 
 > it in the declaration should be viable.

Whether or not it's worth changing is a separate discussion to whether 
or not XML processors should use SGML Declarations.

...
 > >Over the years, people have proposed various schemes for documenting
 > >the capabilities of XML processors that have all reminded me of SGML's
 > >System Declaration, and indicating Blueberry support or lack of it is
 > >probably best left to such an XML mechanism because there's a lot of
 > >stuff in a System Declaration that will never change for XML and that
 > >is of absolutely no interest to someone checking on Blueberry support.
 > 
 > Again, it seems best to use the standard as intended rather than 
 > building in system-specific flags.  There will be no end of it.

The intent of the standard is that it would be followed a year or two
after its publication by a companion formatting standard, that SGML
documents would be shipped around as ASN.1 data streams, that styles
would be associated with elements using link sets, and that you would
use a FSV classification code to rate your system's conformance and
its support of the four concrete syntaxes that were all that you were
ever going to need.

We got DSSSL eventually, and there were a few people passionate about
link sets, but some of the other intents never really influenced the
way we work.  Following the intent of the standard would be very
lonely, I think.

I contend that the pre-1986 intent of the standard was that passing
documents from my SGML system to your SGML system is would be a major
event and that you'd carefully examine my SGML Declaration --
especially its capacities, quantities, concrete syntax, and required
features -- and carefully compare them against your System Declaration
before you even thought about processing my document on your system.

That changed in time, and I doubt that many of the thousands of people 
who've downloaded nsgmls ever analysed its System Declaration before
parsing their first document.  Nor would they have had to, since
nsgmls had capacities greater than you were allowed to specify in an
SGML Declaration.  The System Declaration was only ever useful to a
fraction of a percent of SGML users, and now you want to require it
for the majority of XML users.  I wish you luck.

...
 > Why not?  Do we change XML or change the requirement for the Blueberry 
 > support such that only Blueberry systems have to recognize Blueberry 
 > documents?

No comment.  I joined this thread because you gave half the story on
the SGML Declaration's character set definition and didn't mention how 
well or badly those character set definitions were handled by the
available software.

Regards,


Tony Graham
------------------------------------------------------------------------
Tony Graham                           mailto:tony.graham@ireland.sun.com
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708



Answers:

1. (c)  It does matter.  The order of any FUNCHAR, SEPCHAR, MSOCHAR,
        MSICHAR, or MSSCHAR characters after RE, RS, and SPACE,
        however, doesn't matter and they all require a unique added
        function name in addition to their keyword.

2. (b)  YES means replace lowercase letters with uppercase.

3. (c)

4. (a)  Which sometimes seems a bit odd considering how many other
        things in the SGML Declaration can only appear in a prescribed 
        order.

        Actually, you have to have the SGMLREF keyword, but if you're
        changing any from the default, they can follow the keyword in
        any order.

5. GENERAL in the NAMECASE portion controls case folding of names
   (other than entity names), name tokens, number tokens, and
   delimiter strings.  GENERAL in the DELIM portion is where you
   specify which character numbers (in the syntax-reference character
   set) are assigned to which roles.  For example, '&' is typically
   assigned the AND and ERO roles, and '&#' is assigned the CRO role.

6. One defines the document character set and the other defines the
   syntax-reference character set.

7. The SYNTAX portion ends after the declaration of the quantity set,
   but since that can drag on a bit with no real sign that it's ended, 
   it's simpler to consider that the SYNTAX portion ends before the
   FEATURES keyword.
References:
- RE: Historical I18n Note
  - From: "Bullard, Claude L (Len)" <clbullar@ingr.com>
Prev by Date: RE: XML under JDK 1.0 and spacecraft
Next by Date: Re: XML under JDK 1.0 and spacecraft
Previous by thread: RE: Historical I18n Note
Next by thread: RE: Historical I18n Note
Index(es):
- Date
- Thread