Re: [xml-dev] Many different syntaxes in XML

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] Many different syntaxes in XML - is that good language design?

From: Norman Gray <norman.gray@glasgow.ac.uk>
To: Pete Cordell <pete++xmldev@codalogic.com>
Date: Mon, 07 Mar 2022 16:57:16 +0000

Pete, hello.

On 7 Mar 2022, at 15:52, Pete Cordell wrote:

> Viewed like that it seems a fairly minimal and efficient syntax.

Indeed.  SGML is a thing of some beauty, viewed through the right (rather special) spectacles.

>  (It does make me wonder why the CDATA section 'directive' wasn't just <!CDATA[...]>.  Even more curious, given all the SGML things that got dropped, is how it got included in XML.  It creates just as many problems as it solves.)

It's certainly pretty orthogonal, in all sorts of directions.

Regarding <![CDATA[...]]> vs <!CDATA[ ... ]>, the sequence of tokens here *in SGML* is

  <!  : markup declaration open
  [    : declaration subset open
  CDATA  : status-keyword
  [    : dso again
  ...data
  ]]  : marked section close
  >  : markup declaration close (which happens to be the same character, by default, as element start-tag-close, and a few others)

I'm not 100% clear why the 'declaration subset open' is so-called.  This token is also used to introduce the DTD declaration, full or partial, at the very top of a document (which is written in the 'other' syntax, in the terms of this thread), and it seems to have been reused here _partly_ as a sort-of gesture towards the declaration language of DTDs -- ie, the first '[' is effectively signalling an escape inside the escape, in a different direction.  The SGML status keywords, alongside CDATA, were/are INCLUDE, IGNORE (which includes and ignores the text inside the construct), RCDATA (which is like CDATA except that entities (only) are recognised and expanded (have I got that right?)), and TEMP (which did nothing other than mark the contained text as temporary).  I presume the duplication of the ']' in the marked-section-close is partly to keep the brackets balanced, and partly because it's a string that's unlikely to appear in normal text.  It was possible to have whitespace either side of the status-keyword terms, so that '<![ CDATA   [...]]>' would be a legal SGML declaration.

I think that all of these except CDATA were dropped in XML, along with the different lexical classes, so that (*checks*...) the start of a CDATA section is just '<![CDATA[' as an otherwise unintelligible magic string.  Why that particular magic string and not a saner one?  Purely, I think, to retain the status of XML documents as being also parseable as SGML.  That is, SGML would lex this string differently, but react in the same way.

The other gasp-worthy thing about SGML was that all of these lexical items, such as '<', '<!', and so on and very much on, were configurable, so you could prefix your document with declarations (in the 'other' syntax) which changed these, and have different character sequences open and close start-tags, processing instructions, and so on.  The angle brackets and ampersands we're familiar with are just the SGML defaults.

Enough (slightly deranged) nostalgia!

Best wishes,

Norman

-- 
Norman Gray  :  https://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK

Follow-Ups:
- Re: [xml-dev] Many different syntaxes in XML - is that good languagedesign?
  - From: Pete Cordell <pete++xmldev@codalogic.com>

References:
- Many different syntaxes in XML - is that good language design?
  - From: Roger L Costello <costello@mitre.org>
- Re: [xml-dev] Many different syntaxes in XML - is that good language design?
  - From: Norman Gray <norman.gray@glasgow.ac.uk>
- Re: [xml-dev] Many different syntaxes in XML - is that good languagedesign?
  - From: Pete Cordell <pete++xmldev@codalogic.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]