Next / Previous / Contents / TCC Help System / NM Tech homepage

10. Type unicode: Strings of 32-bit characters

With the advent of the Web as medium for worldwide information interchange, the Unicode character set has become vital. For general background on this character set, see the Unicode homepage.

To get a Unicode string, prefix the string with u. For example:

u'klarn'

is a five-character Unicode string.

To include one of the special Unicode characters in a string constant, use these escape sequences:

\xHH For a code with the 8-bit hexadecimal value HH.
\uHHHH For a code with the 16-bit hexadecimal value HHHH.
\UHHHHHHHH For a code with the 32-bit hexadecimal value HHHHHHHH.

Examples:

>>> u'Klarn.'
u'Klarn.'
>>> u'Non-breaking-\xa0-space.'
u'Non-breaking-\xa0-space.'
>>> u'Less-than-or-equal symbol: \u2264'
u'Less-than-or-equal symbol: \u2264'
>>> u"Phoenician letter 'wau': \U00010905"
u"Phoenician letter 'wau': \U00010905"
>>> len(u'\U00010905')
1

All the operators and methods of str type are available with unicode values.

Additionally, for a Unicode value U, use this method to encode its value as a string of type str:

U.encode ( encoding[, error )

Return the value of U as type str. The encoding argument is a string that specifies the encoding method. In most cases, this will be 'utf_8'. For discussion and examples, see Section 10.1, “The UTF-8 encoding”.

The optional error string specifies what to do with characters that do not have exact equivalents. For example, if you are converting to the ASCII character set, the encoding argument is 'ascii'. Values of the error argument are given in the table below.

'strict' Raise a UnicodeError exception if any character has no ASCII equivalent. This is the default behavior.
'ignore' Leave out characters that have no equivalent.
'replace' Substitute a '?' for each character that has no equivalent.
'xmlcharrefreplace' Use the XML character entity escape sequence for characters with no ASCII equivalent. The general form of this sequence is "&#N;", where N is the decimal value of the character's code point. This feature is very handy for generating internationalized Web pages.
'backslashreplace' Use Python backslash escape sequences to represent characters with no equivalent.

Here are some examples to demonstrate error argument values.

>>> s = u"a\u262ez"
>>> len(s)
3
>>> s
u'a\u262ez'
>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u262e'
in position 1: ordinal not in range(128)
>>> s.encode('ascii', 'ignore')
'az'
>>> s.encode('ascii', 'replace')
'a?z'
>>> s.encode('ascii', 'xmlcharrefreplace')
'a&#9774;z'
>>> hex(9774)
'0x262e'
>>> t = s.encode('ascii', 'backslashreplace')
>>> t
'a\\u262eb'
>>> print t
a\u262eb
>>> len(t)
8
>>> t[1]
'\\'