[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Well-formed Blueberry
- From: Joel Rees <rees@server.mediafusion.co.jp>
- To: jcowan@reutershealth.com
- Date: Tue, 17 Jul 2001 14:47:22 +0900
Thanks.
But I still don't understand.
jcowan@reutershealth.com clarified:
[snipped]
> No, I think Elliotte is right here. There are XML 1.0 documents, which
> lack the Magic Blueberry Mark (whatever it's going to be), and then there
> are Blueberry documents. XML-1.0-only parsers MUST reject Blueberry
> documents: they are not well-formed. Blueberry parsers SHOULD accept
> both Blueberry and 1.0 documents, but MUST apply the 1.0 well-formedness
> rules to 1.0 documents.
This is where I get confused. You say MUST. I still don't understand why. Is
NEL the culprit?
> If a document lacks the Magic Blueberry Mark but
> contains Blueberry names, it is not well-formed and must be rejected.
>
> Therefore, Blueberry parsers have to keep both sets of tables. Luckily,
> the Blueberry table is a strict superset of the 1.0 table,
I read "strict superset", and I think that anything that passed the XML 1.0
parser should pass the Blueberry parser. Is this correct? If it is, why
should a Blueberry capable parser care if a doc that labels itself XML 1.0
slips in a blueberry? I missed the posts that explained the specific damage.
(Or maybe I'm just brain-dead, anyway. It's been a hot, humid summer here.)
Okay, I can see that developers will want to have the wall available to
check against when developing for a context in which some users may be
restricted to XML 1.0. End users won't need the wall, however?
> so it suffices
> to have four tables (or one table that maps Unicode values to one of
> four enumerated values): xml10_name_start, xml10_name_part,
> blueberry_only_name_start, blueberry_only_name_part.
Were we to use four tables, I assume we would want to pack them. But bit
addressing is another choice that can slow things down a bit, so we might
assume the option of one table, four bits per entry, instead. There should
be some long runs of identical values, so we should be able to use sparse
table redundancy reduction techniques without too much of a performance hit.
We should be able to end up with well less than 64K consumed by these
tables.
So there would probably be no need to remove the XML 1.0 tables from the
parsers that end-users will use, and that would keep things a little more
orderly. Not _required_ to have both tables for end-users, but we might as
well. Would it be appropriate to suggest that said table could handle two
more extensions to UNICODE without physically growing?
[snipped]
> IMHO the snag here would be getting an absolutely authoritative and
> permanent list of such character sets,
[snipped]
This is definitely going to be a problem. If I read it one way, I want to
push the jump to Blueberry now, before best practices really muddy things
up.
But I will admit to this, I personally would prefer to handle Kanji with a
small standard set of radicals and something like the ideographic
description sequences. I am pretty sure this would be enough to uniquely
identify every current Kanji, and it would open the way for creataive
non-standard writings, comparable to the ability we have in English for
creative spellings.
How many implementations of such a scheme have I seen? Zero.
(The present overabundance of code points dedicated to Kanji makes more
sense as a set of internal references to a pre-rendered font, rather than
standardized character code points. One early reference for the JIS
character set in my possession indicates that the JIS committee originally
assumed that something like ideographic description sequences would be the
ultimate approach for general information encoding, and that the JIS
character set was intended primarily as an internal reference set for
predefined font tables for the printing industry. At least, that's the way I
read it.)