Lists Home |
Date Index |
John Cowan <email@example.com> wrote:
| Am I right in thinking that a SHUNCHAR should not appear as-is in an
Unless it is a function character (e.g. RS, RE and SEPCHAR, which is how
0x0a, 0x0d and 0x09 can appear literally)
| but may be referred to by a character reference?
Yes - any character in the document character set can be encoded as a
character reference and will be treated unconditionally as data at the
point of occurrence. This *includes* NONSGML characters (the ones mapped
For example, this document is valid with the doc char set mapped to
Latin-1, Unicode etc. (where decimal 128-159 are UNUSED):
<!DOCTYPE foo [
<!ELEMENT foo - - (#PCDATA) >
<!ENTITY bar CDATA "œ" >
The character reference is unconditionally data in the replacement text of
the entity declaration. The CDATA keyword now says that the replacement
text is still data anywhere the entity reference occurs. Take out the
CDATA modifier, and now nsgmls will throw an error like this:
nsgmls:ex.txt:5:11:E: non SGML character number 156
Note that the error is at the point where the entity reference occurs in
the instance (5:11), not in the entity declaration.
| If not, is there any way in the SGML declaration to specify characters
| that have this property?
If you're asking, is there a way to *require* that a character reference
always be used for a character, then the answer would be to ensure that
it's in the non-SGML character class (because then showing up directly
would throw an error.) But that isn't foolproof, because if the context
gets reparsed for markup (as the entity replacement text in the example
above would without the CDATA modifier) you would still get an error.
| Also, how strong is that "should not appear" in practice?
Absolute. 13.1.2 "Non-SGML Character Identification" (p.455 in the
: Each _character numnber_ to which no meaning is assigned by the
: _character set description_ is assigned to NONSGML, thereby
: identifying it as a non-SGML character.
: A shunned character must be identified as a non-SGML character, unless
: it is a significant SGML character.