OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Summary of draft W3C Character Model for programmers

[ Lists Home | Date Index | Thread Index ]

I thought XML-DEVers might be interested in a summary of what the new draft
W3C "Character Model for the World Wide Web"[1] says.  

Keep in mind that the primary readership of the character model is standards 
developers first, and then implementers and users of standards second.   If you don't 
do these,  you cannot claim your software conforms to the "Character Model for the 
WWW"  fully. (Of course, there may be good reasons: the model sets the bar for W3C 
specs and a goal for implementers.)


1-7 deal with ASCII assumptions that are no longer appropriate for Unicode programmes.  
  This is basic I18n. I would say that in Java now it is hard not to do this.
8-14 deal with issue that are built into XML or are best practise for XML. 
15-18 deal with handling legacy data. This is the contentious one.
19-21 deal with strings, indexing and matching


1) Specifications and software MUST NOT assume that there is a one-to-one
correspondence between characters and the sounds of a language.

2) Specifications and software MUST NOT assume a one-to-one mapping between 
character codes and units of displayed text.

3) Specifications and software MUST NOT assume that a single keystroke results 
in a single character, nor that a single character can be input with a single 
keystroke (even with modifiers), nor that keyboards are the same all over the world.

4) Software that sorts or searches text for users MUST do so on the basis of appropriate 
collation units and ordering rules for the relevant language and/or application.

5) Software that allows users to sort or search text SHOULD allow the user to select 
alternative rules for collation units and ordering.

6) When sorting and searching in the context of a particular language, it MUST be 
possible to deal gracefully with strings being compared that contain Unicode 
characters not normally associated with that language. 

7) Specifications and software MUST NOT assume a one-to-one relationship between 
characters and units of physical storage.

8) Receiving software MUST determine the encoding of data from available information 
according to appropriate specifications. When an IANA-registered charset name is 
recognized, receiving software MUST interpret the received data according to the 
encoding associated with the name in the IANA registry. When no charset is 
provided receiving software MUST adhere to the default encoding(s) specified in 
the specification

9) Receiving software MAY recognize as many encodings (names and aliases) as 

10) Software MUST completely implement the mechanisms for character encoding 
identification and SHOULD implement them in such a way that they are easy to 
use (for instance in HTTP servers). On interfaces to other protocols, software 
SHOULD support conversion between Unicode encoding forms as well as any 
other necessary conversions.

11) Software and content MUST carefully follow conflict-resolution mechanisms 
where there is multiple or conflicting information about character encoding.

12) Escapes SHOULD be avoided when the characters to be expressed are 
representable in the character encoding of the document.

13) Since character set standards usually list character numbers as hexadecimal, 
content SHOULD use the hexadecimal form of character escapes when there is one.

14) Choose an encoding for the document that maximizes the opportunity to directly 
represent characters and minimizes the need to represent characters by markup means 
such as character escapes. In general, if the first encoding choice is not satisfactory, 
Unicode is the next best choice, for its large character repertoire and its wide base of 

15) A text-processing component that receives suspect text MUST NOT perform any 
normalization-sensitive operations unless it has first confirmed through inspection 
that the text is in normalized form, and MUST NOT normalize the suspect text. 
Private agreements MAY, however, be created within private systems which are 
not subject to these rules, but any externally observable results MUST be the same 
as if the rules had been obeyed.

16)  A text-processing component which modifies text and performs normalization-sensitive 
operations MUST behave as if normalization took place after each modification, so that 
any subsequent normalization-sensitive operations always behave as if they were dealing 
with normalized text.

17) Authoring tool implementations for a (formal) language that does not mandate full-
normalization SHOULD prevent users from creating content with composing characters 
at the beginning of constructs that may be significant, such as at the beginning of an 
entity that will be included, immediately after a construct that causes inclusion or 
immediately after markup, or SHOULD warn users when they do so.

18) Implementations which transcode text from a legacy encoding to a Unicode encoding 
form MUST use a normalizing transcoder.

19) String identity matching MUST be performed as if the following steps were followed: 

*Early uniform normalization to fully-normalized form, as defined in 4.2.3 Fully-normalized 
text. In accordance with section 4 Early Uniform Normalization, this step MUST be 
performed by the producers of the strings to be compared.

*Conversion to a common encoding of UCS, if necessary.

*Expansion of all recognized character escapes and includes.

*Testing for bit-by-bit identity.

20) Forms of string matching other than identity matching SHOULD be performed as 
if the following steps were followed:

*Steps 1 to 3 for string identity matching.
*Matching the strings in a way that is appropriate to the application.

21) The character string is RECOMMENDED as a basis for string indexing. 

22) A code unit string MAY be used as a basis for string indexing if this results in a significant improvement in the efficiency of internal operations when compared to the use of character string. 

23) Users of specifications (software developers, content developers) SHOULD whenever 
possible prefer ways other than string indexing to identify substrings or point within a string.

So how does an implementer respond to Charmod?  We use code audits at Topologi:
we have had them for internationalization, accessability, font metrics, design and unit tests. 
I think Charmod provides a useful checklist for code inspection and awareness-raising: 
almost all of it can be achieved simply by using the most recent version of standard APIs: 
the guidelines suggest which ones to use.  

On the other hand, we deliberately chose to violate 20 for our searches, because that is 
appropriate: we use a third-party regular expression library so it is outside our capability
to expand numeric character references in text first. That is something that the regex
library developer should think about. 

If you look at requirement 2) "Specifications and software MUST NOT assume a 
one-to-one mapping between character codes and units of displayed text." you can
see it does not go very far: for example, handling U character followed by an
umlaut character is one thing, but handling Indic languages where accents can
go before the character is another.  The requirement as specified is actually 
a really *low* bar: hence my characterization of it as being aimed more at
challenging ASCII assumptions rather than guaranteeing universal applications.

Note, for instance, that the Character Model does not place any requirement
on implementations that they handle discontinuous selection in Arabic/Latin

Rick Jelliffe

[1] http://www.w3.org/TR/charmod/


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS