OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Well-formed Blueberry

On Mon, 16 Jul 2001, Joel Rees wrote:

> However, I am having a hard time figuring out why the standard should treat
> authors of nonstandard XML documents better than people who simply want to
> use their own language in markup.

I'm confused.  What nonstandard documents?

> In a corollary point of confusion for me, you seem to assume in your
> posts that, even without your wall, a blueberry capable parser must have
> both the pre-blueberry character classification tables and the blueberry
> character classification tables. In my naive point of view, a document
> that is valid XML 1.0 ought to be valid blueberry, thus, the complete
> table should be the only necessary table, unless you want to build a
> wall. 

No, I think Elliotte is right here.  There are XML 1.0 documents, which
lack the Magic Blueberry Mark (whatever it's going to be), and then there
are Blueberry documents.  XML-1.0-only parsers MUST reject Blueberry
documents: they are not well-formed.  Blueberry parsers SHOULD accept
both Blueberry and 1.0 documents, but MUST apply the 1.0 well-formedness
rules to 1.0 documents.  If a document lacks the Magic Blueberry Mark but
contains Blueberry names, it is not well-formed and must be rejected.

Therefore, Blueberry parsers have to keep both sets of tables.  Luckily,
the Blueberry table is a strict superset of the 1.0 table, so it suffices
to have four tables (or one table that maps Unicode values to one of
four enumerated values):  xml10_name_start, xml10_name_part,
blueberry_only_name_start, blueberry_only_name_part.

Elliotte Rusty Harold writes:

> There are not that many encodings that can
> handle the Blueberry characters,
> basically just several variants of Unicode, one Japanese character set, and
> possibly a couple of Chinese character sets.

IMHO the snag here would be getting an absolutely authoritative and
permanent list of such character sets, since they would have to be
hard-coded (contrary to previous practice) into the Blueberry

Limiting Blueberry to just Unicode would probably work for most of the
new (to Unicode) scripts, as you say, but would not be so good for
Han characters.