OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] SGML query: SHUNCHAR

[ Lists Home | Date Index | Thread Index ]

John Cowan <jcowan@reutershealth.com> wrote:

| Am I right in thinking that a SHUNCHAR should not appear as-is in an 
| entity,

Unless it is a function character (e.g. RS, RE and SEPCHAR, which is how
0x0a, 0x0d and 0x09 can appear literally) 

| but may be referred to by a character reference? 

Yes - any character in the document character set can be encoded as a
character reference and will be treated unconditionally as data at the
point of occurrence.  This *includes* NONSGML characters (the ones mapped
to UNUSED).  

For example, this document is valid with the doc char set mapped to
Latin-1, Unicode etc. (where decimal 128-159 are UNUSED):

  <!DOCTYPE foo [
      <!ELEMENT foo  - - (#PCDATA) >
      <!ENTITY  bar  CDATA  "&#156;" >

The character reference is unconditionally data in the replacement text of
the entity declaration.  The CDATA keyword now says that the replacement
text is still data anywhere the entity reference occurs.  Take out the
CDATA modifier, and now nsgmls will throw an error like this:

nsgmls:ex.txt:5:11:E: non SGML character number 156

Note that the error is at the point where the entity reference occurs in
the instance (5:11), not in the entity declaration.

| If not, is there any way in the SGML declaration to specify characters 
| that have this property?

If you're asking, is there a way to *require* that a character reference
always be used for a character, then the answer would be to ensure that
it's in the non-SGML character class (because then showing up directly
would throw an error.)   But that isn't foolproof, because if the context
gets reparsed for markup (as the entity replacement text in the example
above would without the CDATA modifier) you would still get an error.
| Also, how strong is that "should not appear" in practice?

Absolute.  13.1.2 "Non-SGML Character Identification" (p.455 in the

: Each _character numnber_[64] to which no meaning is assigned by the 
: _character set description_[73] is assigned to NONSGML, thereby
: identifying it as a non-SGML character.
: [...] 
: A shunned character must be identified as a non-SGML character, unless
: it is a significant SGML character.


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS