Next / Previous / Contents / TCC Help System / NM Tech homepage

10.1. The UTF-8 encoding

How, you might ask, do we pack 32-bit Unicode characters into 8-bit bytes? Quite prevalent on the Web and the Internet generally is the UTF-8 encoding, which allows any of the Unicode characters to be represented as a string of one or more 8-bit bytes.

First, some definitions:

This diagram shows how UTF-8 encoding works. The first 128 code points (hexadecimal 00 through 7F) are encoded as in normal 7-bit ASCII, with the high-order bit always 0. For code points above hex 7F, all bytes have the high-order (0x80) bit set, and the bits of the code point are distributed through two, three, or four bytes, depending on the number of bits needed to represent the code point value.

To encode a Unicode string U, use this method:

U.encode('utf_8')

To decode a regular str value S that contains a UTF-8 encoded value, use this method:

S.decode('utf_8')

Examples:

>>> tilde='~'
>>> tilde.encode('utf_8')
'~'
>>> u16 = u'\u0456'
>>> s = u16.encode('utf_8')
>>> s
'\xd1\x96'
>>> s.decode('utf_8')
u'\u0456'
>>> u32 = u'\U000E1234'
>>> s = u32.encode('utf_8')
>>> s
'\xf3\xa1\x88\xb4'
>>> s.decode('utf_8')
u'\U000e1234'

UTF-8 is not the only encoding method. For more details, consult the documentation for the Python module codecs.