I understand "codepoints" ... hurrah!

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: "Costello, Roger L." <costello@mitre.org>
To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Date: Tue, 27 Nov 2012 13:42:53 +0000

Hi Folks,

Below is my understanding of codepoints.

Is it correct? Easy to understand? Complete?

Suggestions welcome.

-------------------------
Unicode Codepoints
-------------------------
In the English alphabet there is a letter 'c'. In the Cyrillic alphabet there is also a letter 'c'. On computer screens the two letters look exactly the same, yet they are completely different letters. A URL containing the letter 'c' from the English alphabet will cause browsers to go to a completely different location than the same URL containing the letter 'c' from the Cyrillic alphabet. The URLs appear exactly the same on the screen but cause browsers to go to completely different locations.

Let's see why.

Inside a computer there are no letters. There are only zeroes and ones. Each zero and one is called a bit. When you see the letter 'c' on the screen it is a visual representation of a sequence of bits. That sequence of bits is the encoding for the letter 'c'. The Unicode Consortium is a standards organization that has devised encodings for every character in every language. This set of encodings is called the Unicode encodings.

Each encoding is called a codepoint.

Recall that an encoding is a sequence of zeroes and ones. If we think of the sequence of zeroes and ones as representing a binary number, then the encoding corresponds to a number.

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

Through simple arithmetic a binary number can be converted to a decimal or a hexadecimal value. So, rather than referring to an encoding by its sequence of bits or by a binary number, it can be referred to by a decimal or hexadecimal value.

Example: suppose that inside the computer is this sequence of bits: 0110 0011

The Unicode Consortium has decided that that sequence of bits is to be the encoding for the letter 'c' in the English alphabet. If we interpret that sequence of bits as a number, then it is the decimal number 99. In hexadecimal it is the number 63. The Unicode Consortium typically uses the hexadecimal (hex) value. So the letter 'c' in the English alphabet is encoded as hex 63. The Unicode Consortium typically shows encodings in this form: U+xxxx, where xxxx is a 4-digit hex value. The letter 'c' in the English alphabet is encoded as U+0063. That encoding -- U+0063 -- is the codepoint for the letter 'c' in the English alphabet.

Remember I said that the letter 'c' in the English alphabet is completely different than the letter 'c' in the Cyrillic alphabet. Here's why: the codepoint for the letter 'c' in the English alphabet is U+0063 whereas the codepoint for the letter 'c' in the Cyrillic alphabet is U+0441. While the two letters appear identical on the computer screen, inside the computer are completely different sequences of bits.

Follow-Ups:
- Re: [xml-dev] I understand "codepoints" ... hurrah!
  - From: Chris Maloney <voldrani@gmail.com>
- Re: [xml-dev] I understand "codepoints" ... hurrah!
  - From: David Carlisle <davidc@nag.co.uk>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]