OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] I understand "codepoints" ... hurrah!


Others have provided some clarifications/corrections to your text, so 
I will avoid repeating those.

My remaining comment deals with your first paragraph, in which you 
say that "the two letters look exactly the same [on the 
screen]".  The word "look" in this context refers to the appearance 
of the glyph, which is of course a font issue.  That is, in a 
collection of, say, English-language documents, the lower-case 
version of the third letter of the Latin alphabet as used in English, 
pronounced "see", can have a very wide variety of appearances due to 
the use of different fonts.  The same is true of the letter 
pronounces "ess" in the Cyrillic alphabet.

It is fair to say that the "canonical" shapes of the Latin/English 
"see" and the Cyrillic "ess" are substantially the same.

(It's worth noting parenthetically that the fact that two (indeed, in 
a number of cases, far more than two) different characters -- by 
definition, as they have different codepoints -- share the same 
"canonical" shape is the cause of potential major security problems in URIs.)

With that in mind, my summary of codepoints is:

* Digital devices process only bits, which are usually collected into 
units (typically units of 8, 16, 32, 64, or 128 bits).  Those units 
-- groups of bits -- are used to represent several kinds of data, 
including integer values and, using schemes in which the bits of a 
unit are subdivided, approximate numeric values.  Textual data, made 
up of characters, is represented by assigning each character a code 
made up of one or more of those units.  Those codes are treated as 
integer values (which can, of course, be expressed in binary, octal, 
decimal, hexadecimal, and any other radix).

* A character encoding scheme (frequently the subject of a de jure or 
de facto standard) is an assignment of a single unique number to each 
character encompassed by the scheme.  Within such a scheme, the code 
assigned to a character is that character's codepoint.  There are 
many such schemes.

* Assignment of numbers to characters does not relate to the visual 
appearance of characters, but to the semantics of the 
characters.  For example, characters that have similar appearances in 
some "canonical" form, but that belong to different scripts or have 
significantly different usages, are not merely variations of the same 
character, but are unique characters and consequently have different 
numbers assigned to them.

* Unicode is a scheme (actually, a de facto standard) that assigns a 
single unique number to each character in every script and every 
language.  (Observe that not all scripts and languages have yet been 
incorporated into the scheme, but it is anticipated that the work 
will eventually be completed.)

Hope this helps,

At 11/27/2012 06:42 AM, Costello, Roger L. wrote:
>Hi Folks,
>Below is my understanding of codepoints.
>Is it correct? Easy to understand? Complete?
>Suggestions welcome.
>Unicode Codepoints
>In the English alphabet there is a letter 'c'. In the Cyrillic 
>alphabet there is also a letter 'c'. On computer screens the two 
>letters look exactly the same, yet they are completely different 
>letters. A URL containing the letter 'c' from the English alphabet 
>will cause browsers to go to a completely different location than 
>the same URL containing the letter 'c' from the Cyrillic alphabet. 
>The URLs appear exactly the same on the screen but cause browsers to 
>go to completely different locations.
>Let's see why.
>Inside a computer there are no letters. There are only zeroes and 
>ones. Each zero and one is called a bit. When you see the letter 'c' 
>on the screen it is a visual representation of a sequence of bits. 
>That sequence of bits is the encoding for the letter 'c'.  The 
>Unicode Consortium is a standards organization that has devised 
>encodings for every character in every language. This set of 
>encodings is called the Unicode encodings.
>Each encoding is called a codepoint.
>Recall that an encoding is a sequence of zeroes and ones. If we 
>think of the sequence of zeroes and ones as representing a binary 
>number, then the encoding corresponds to a number.
>Unicode provides a unique number for every character, no matter what 
>the platform, no matter what the program, no matter what the language.
>Through simple arithmetic a binary number can be converted to a 
>decimal or a hexadecimal value. So, rather than referring to an 
>encoding by its sequence of bits or by a binary number, it can be 
>referred to by a decimal or hexadecimal value.
>Example: suppose that inside the computer is this sequence of bits: 0110 0011
>The Unicode Consortium has decided that that sequence of bits is to 
>be the encoding for the letter 'c' in the English alphabet. If we 
>interpret that sequence of bits as a number, then it is the decimal 
>number 99. In hexadecimal it is the number 63. The Unicode 
>Consortium typically uses the hexadecimal (hex) value. So the letter 
>'c' in the English alphabet is encoded as hex 63. The Unicode 
>Consortium typically shows encodings in this form: U+xxxx, where 
>xxxx is a 4-digit hex value. The letter 'c' in the English alphabet 
>is encoded as U+0063. That encoding -- U+0063 -- is the codepoint 
>for the letter 'c' in the English alphabet.
>Remember I said that the letter 'c' in the English alphabet is 
>completely different than the letter 'c' in the Cyrillic alphabet. 
>Here's why: the codepoint for  the letter 'c' in the English 
>alphabet is U+0063 whereas the codepoint for the letter 'c' in the 
>Cyrillic alphabet is U+0441. While the two letters appear identical 
>on the computer screen, inside the computer are completely different 
>sequences of bits.

Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
   Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax : +1.801.942.3345
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org
Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com
=  Facts are facts.   But any opinions expressed are the opinions      =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS