xml-dev - RE: [xml-dev] UTF-8+names

RE: [xml-dev] UTF-8+names

[ Lists Home | Date Index | Thread Index ]

To: "'David Carlisle'" <davidc@nag.co.uk>
Subject: RE: [xml-dev] UTF-8+names
From: "Alessandro Triglia" <sandro@mclink.it>
Date: Sun, 19 Oct 2003 03:21:00 -0400
Cc: <xml-dev@lists.xml.org>
Importance: Normal
In-reply-to: <200310182112.WAA26397@e3000>

David Carlisle wrote: 
> 
> 
> > As I understand, in UTF-8+name, an ampersand is represented 
> as  &&;  
> > which means that, if UTF-8+name is used for XML, "normal" entity 
> > references will look like:
> > 
> > 	&&;myentity;
> 
> Not necessarily, &myentity; would also work so long as it 
> wasn't one of the predefined names. If the entity isn't 
> "known" then it expands to itself in the character encoding, 
> leaving the entity to be expanded by the XML parser in the usual way.

I agree, but please see what I wrote in my previous email about a program
that is to produce a UTF-8+names encoding from a string of Unicode
characters.  What would you think should be the recommended behavior of such
a program wrt. how to encode AMPERSAND characters?

> 
> > and numeric character references will look like:
> > 
> > 	&&;#12345;
> 
> similarly only one & is needed here as well.
> 
> > 	&lt;
> > 
> > but this can be confusing because it would denote a **literal** < 
> > character,
> 
> No it's defined to have the definition in xhtml and mathml 
> which is the definition given in the xml spec, double 
> escaped, so it would expand to a character reference to a < 
> character, not a literal <.

Yes, I noticed that I had missed this.  Anyway, what you say above may mean
one of two different things:

1) &lt; is defined as a replacement name in UTF-8+names, which implies that
the bytes will be decoded into the characters  & # 6 0 ;  (following XML
1.0) and the XML processor will substitute the character  <  on parsing
those characters

2) &lt; is *not* defined as a replacement name in UTF-8+names, which implies
that the bytes will be decoded one by one into the characters  & l t ;   and
the XML processor will "include" the predefined entity lt and eventually
substitute the character  <

Although the effect of (1) and (2) will be the same when parsing an XML
document, it will not be the same when decoding a sequence of bytes in a
non-XML context.  I am not sure the document is clear on this.  At any rate,
I don’t think it would be a good idea to decode   & l t ;  into the
characters   & # 6 0 ;   because this sequence of characters is meaningless
outside of XML.  So  &lt;  should really not be a defined replacement name
in UTF-8+names.

I have a question about all the other entities defined in XHTML and MathML.
Do all of them resolve to actual characters, or do some of them resolve to
escaped references (like &lt; does)?  If some entities resolve to escaped
character references, they need an XML context to work correctly, and
therefore should not be included among the defined replacements in
UTF-8+names (because a Unicode encoding should not rely on XML to work
correctly).

Alessandro

> 
> > It is not very clear to me where UTF-8+name would be useful, as I 
> > don't think it is useful in XML.  Is it being proposed for use in 
> > areas where, for some reason, XML cannot be used?
> 
> No its whole point is to allow the use of &rightarrow; or 
> &eacute; _with_ XML but _without_ a DTD to allow for relax or 
> xsd schema use, or just simply well formed fragments with no 
> schema at all.
> 
> 
> some other people have suggested not using & as the delimiter 
> but again that would break the main use case of this, the 
> FFFFAQ question on xsl-list asking why "& n b s p ;" 
> generates an error in xsl.
> 
> David
>

Follow-Ups:
- Re: [xml-dev] UTF-8+names
  - From: John Cowan <cowan@mercury.ccil.org>

Prev by Date: RE: [xml-dev] UTF-8+names
Next by Date: Re: [xml-dev] UTF-8+names
Previous by thread: Re: [xml-dev] UTF-8+names
Next by thread: Re: [xml-dev] UTF-8+names
Index(es):
- Date
- Thread