How, you might ask, do we pack 32-bit Unicode characters into 8-bit bytes? Quite prevalent on the Web and the Internet generally is the UTF-8 encoding, which allows any of the Unicode characters to be represented as a string of one or more 8-bit bytes.
First, some definitions:
A code point is a number representing a unique member of the Unicode character set.
The Unicode code points are visualized as a three-dimensional structure made of planes, each of which has a range of 65536 code points organized as 256 rows of 256 columns each.
The low-order eight bits of the code point select the column; the next eight more significant bits select the row; and the remaining most significant bits select the plane.
This diagram shows how UTF-8 encoding works. The first
128 code points (hexadecimal 00 through 7F) are encoded
as in normal 7-bit ASCII, with the high-order bit always 0. For
code points above hex 7F, all bytes have the high-order
0x80) bit set, and the bits of the code
point are distributed through two, three, or four
bytes, depending on the number of bits needed to
represent the code point value.
To encode a Unicode string
, use this method:
To decode a regular
that contains a
UTF-8 encoded value, use this method:
>>> tilde='~' >>> tilde.encode('utf_8') '~' >>> u16 = u'\u0456' >>> s = u16.encode('utf_8') >>> s '\xd1\x96' >>> s.decode('utf_8') u'\u0456' >>> u32 = u'\U000E1234' >>> s = u32.encode('utf_8') >>> s '\xf3\xa1\x88\xb4' >>> s.decode('utf_8') u'\U000e1234'
UTF-8 is not the only encoding method. For more
details, consult the documentation
for the Python module