Hello Ihe,
let me attempt to give a flavor of SGML for those who, like me, came
to markup via XML:
SGML has every single bit that XML has - elements, model groups,
attributes, processing instructions, markup declarations, and so on -
because XML is specified as a proper subset of SGML. Well, not quite:
XML brought XML-style empty elements (eg "<x/>") and "generic" markup
(= referring to the fact that in traditional SGML, a DTD was required,
when XML makes these optional). However, these features were added to
SGML in Annex K by the Extended Review Board revision (the same group
that spec'd XML), precisely so that XML could remain a proper SGML
subset. Worth noting that the separate namespaces spec adds god-damned
namespace which SGML lacks (though SGML allows CONCUR and multiple
DOCTYPEs).
The major things left out from the XML subset of SGML:
- tag inference and element/attribute short forms: this is one of the
things that makes traditional SGML require declarations/DTDs, and
which reveals an original vision for markup that is subtly different
from XML's, namely that you as an author get to pick a vocabulary, and
possibly map it into an output/delivery vocabulary via SGML LINK,
rather than expect a given vocabulary designed by comittee; tag
inference can have a major influence over where fragments of markup
text can be
- marked sections other than CDATA sections: where the additional
keywords can be used to annotate editorial content or to conditionally
include/exclude content and/or markup declarations
- short references: a feature letting you define tokens that SGML
replaces into other text (typically tags), and which significantly
extends SGML's capabilities (for example, it can parse a fragment of
markdown Wiki syntax, or CSV into canonical angle-bracket markup)
- link declaration sets (LINK): an additional type of declaration set
that can be used to redefine entities based on a processing profile,
and that can remap elements to other ones in the same or other
vocabularies in comparatively complex forms using an automaton
encoding (covering much of CSS core selectors), and form pipelines and
"views" of markup streams
- notations and advanced forms of entities: which, even though part of
XML, but rarely used there, in SGML are not only a means to annotate
non-SGMLish text such as TeX-like equations, but a general-purpose
extension mechanism
- CONCUR: a (rarely used) feature to specify overlapping markup, as in
"<(x)a>foo<(x|y)c>bar<(x)a><(x|y)c>"
- SGML declaration: an (archaic) piece of text that lets you redefine
markup delimiters/function characters and limits (such as maximal
expansion depth for entities) and specify features and details about
used character encodings and other things; before you ask, the
character encoding part of an SGML declaration is *not* obsolete in
times of Unicode, since Unicode is such a vast character repertoire
that it is almost useless to say a doc is using any Unicode
(encoding), since no font will cover all of those characters anyway
Worth noting that the SGML specification doesn't say anything about
APIs (such as SaX) or processing; only mere markup rules and languages
are specified, and everything else is "up to the application".
Overall, many features that were left out from XML focus on typing
markup via a plain text editor, while others let you organize
boilerplate content better than is possible or can even amount to
implement small non-Turing content applications such as for paging,
filtering, creating a table of content/outline, mail-merge
applications, and similar. And while these features don't cover modern
animated or interactive text apps, the complementary (historic) HyTime
spec that does brought a couple additional features seen as belonging
into the broader SGML landscape, such as FSIDR (Formal System
Identifiers) which lets you pull content from HTTP, SQL, or other
sources in ways that URIs can not.
You can read up all the details in eg James Clark's Comparison of SGML
and XML [1], but SGML amounts to more than the sum of its parts and
conveys an original vision for markup absent from its red-headed step
child XML ;) And SGML is the only standardized markup language that
can deal with HTML (modulo minor oddities), the most important
application of markup by far.
As a markup expert, you owe it to yourself to know some SGML. Learning
SGML itself is a fascinating experience and can give you inspiration,
insight, and judgement over markup applications not possible with
knowing the XML stack alone. Attempting to learn SGML is also a
tragicomic experience, in that materials on SGML are few and far
between, and buried under search engine mis-rankings, making it
painfully obvious why we must keep insisting on formats for text
preservation. Considering SGML covers almost 40 years of digital text
and the good parts of the Web, SGML is extremely influential and vast
like very few other things, on the scale of Titus Livius and Garamond.
Yet SGML also demarcates and informs us about the outer limit of what
a non CompSci nerd can swallow and accept as a reasonable authoring
format, considering it was invented by a lawyer.
[1]: https://www.w3.org/TR/NOTE-sgml-xml-971215/
Best regards,
M. Reichardt
On 9/15/21, Ihe Onwuka <ihe.onwuka@gmail.com> wrote:
> On Tue, Sep 14, 2021 at 9:24 PM Rick Jelliffe <rjelliffe@allette.com.au>
> wrote:
>
>> Is a Swiss Army knife (SGML) over-engineered compared to a knife and fork
>> (XML)? For the people who need to do complex things, no. For the people
>> who need to do a few simple things, yes.
>>
>
> I like the sentiment of the post but I have always seen the Swiss Army
> Knife as the epitome of under-engineering. Capable of doing everything
> .....badly.
>
> No dis on SGML over which I plead ignorance, more wondering whether that's
> the apt analogy for it.
>