Lists Home |
Date Index |
I thought XML-DEVers might be interested in a summary of what the new draft
W3C "Character Model for the World Wide Web" says.
Keep in mind that the primary readership of the character model is standards
developers first, and then implementers and users of standards second. If you don't
do these, you cannot claim your software conforms to the "Character Model for the
WWW" fully. (Of course, there may be good reasons: the model sets the bar for W3C
specs and a goal for implementers.)
1-7 deal with ASCII assumptions that are no longer appropriate for Unicode programmes.
This is basic I18n. I would say that in Java now it is hard not to do this.
8-14 deal with issue that are built into XML or are best practise for XML.
15-18 deal with handling legacy data. This is the contentious one.
19-21 deal with strings, indexing and matching
1) Specifications and software MUST NOT assume that there is a one-to-one
correspondence between characters and the sounds of a language.
2) Specifications and software MUST NOT assume a one-to-one mapping between
character codes and units of displayed text.
3) Specifications and software MUST NOT assume that a single keystroke results
in a single character, nor that a single character can be input with a single
keystroke (even with modifiers), nor that keyboards are the same all over the world.
4) Software that sorts or searches text for users MUST do so on the basis of appropriate
collation units and ordering rules for the relevant language and/or application.
5) Software that allows users to sort or search text SHOULD allow the user to select
alternative rules for collation units and ordering.
6) When sorting and searching in the context of a particular language, it MUST be
possible to deal gracefully with strings being compared that contain Unicode
characters not normally associated with that language.
7) Specifications and software MUST NOT assume a one-to-one relationship between
characters and units of physical storage.
8) Receiving software MUST determine the encoding of data from available information
according to appropriate specifications. When an IANA-registered charset name is
recognized, receiving software MUST interpret the received data according to the
encoding associated with the name in the IANA registry. When no charset is
provided receiving software MUST adhere to the default encoding(s) specified in
9) Receiving software MAY recognize as many encodings (names and aliases) as
10) Software MUST completely implement the mechanisms for character encoding
identification and SHOULD implement them in such a way that they are easy to
use (for instance in HTTP servers). On interfaces to other protocols, software
SHOULD support conversion between Unicode encoding forms as well as any
other necessary conversions.
11) Software and content MUST carefully follow conflict-resolution mechanisms
where there is multiple or conflicting information about character encoding.
12) Escapes SHOULD be avoided when the characters to be expressed are
representable in the character encoding of the document.
13) Since character set standards usually list character numbers as hexadecimal,
content SHOULD use the hexadecimal form of character escapes when there is one.
14) Choose an encoding for the document that maximizes the opportunity to directly
represent characters and minimizes the need to represent characters by markup means
such as character escapes. In general, if the first encoding choice is not satisfactory,
Unicode is the next best choice, for its large character repertoire and its wide base of
15) A text-processing component that receives suspect text MUST NOT perform any
normalization-sensitive operations unless it has first confirmed through inspection
that the text is in normalized form, and MUST NOT normalize the suspect text.
Private agreements MAY, however, be created within private systems which are
not subject to these rules, but any externally observable results MUST be the same
as if the rules had been obeyed.
16) A text-processing component which modifies text and performs normalization-sensitive
operations MUST behave as if normalization took place after each modification, so that
any subsequent normalization-sensitive operations always behave as if they were dealing
with normalized text.
17) Authoring tool implementations for a (formal) language that does not mandate full-
normalization SHOULD prevent users from creating content with composing characters
at the beginning of constructs that may be significant, such as at the beginning of an
entity that will be included, immediately after a construct that causes inclusion or
immediately after markup, or SHOULD warn users when they do so.
18) Implementations which transcode text from a legacy encoding to a Unicode encoding
form MUST use a normalizing transcoder.
19) String identity matching MUST be performed as if the following steps were followed:
*Early uniform normalization to fully-normalized form, as defined in 4.2.3 Fully-normalized
text. In accordance with section 4 Early Uniform Normalization, this step MUST be
performed by the producers of the strings to be compared.
*Conversion to a common encoding of UCS, if necessary.
*Expansion of all recognized character escapes and includes.
*Testing for bit-by-bit identity.
20) Forms of string matching other than identity matching SHOULD be performed as
if the following steps were followed:
*Steps 1 to 3 for string identity matching.
*Matching the strings in a way that is appropriate to the application.
21) The character string is RECOMMENDED as a basis for string indexing.
22) A code unit string MAY be used as a basis for string indexing if this results in a significant improvement in the efficiency of internal operations when compared to the use of character string.
23) Users of specifications (software developers, content developers) SHOULD whenever
possible prefer ways other than string indexing to identify substrings or point within a string.
So how does an implementer respond to Charmod? We use code audits at Topologi:
we have had them for internationalization, accessability, font metrics, design and unit tests.
I think Charmod provides a useful checklist for code inspection and awareness-raising:
almost all of it can be achieved simply by using the most recent version of standard APIs:
the guidelines suggest which ones to use.
On the other hand, we deliberately chose to violate 20 for our searches, because that is
appropriate: we use a third-party regular expression library so it is outside our capability
to expand numeric character references in text first. That is something that the regex
library developer should think about.
If you look at requirement 2) "Specifications and software MUST NOT assume a
one-to-one mapping between character codes and units of displayed text." you can
see it does not go very far: for example, handling U character followed by an
umlaut character is one thing, but handling Indic languages where accents can
go before the character is another. The requirement as specified is actually
a really *low* bar: hence my characterization of it as being aimed more at
challenging ASCII assumptions rather than guaranteeing universal applications.
Note, for instance, that the Character Model does not place any requirement
on implementations that they handle discontinuous selection in Arabic/Latin