Using Unicode Literals in Python

By: Guido Viewed: 161 times  Printer Friendly Format    


In Python source code, specific Unicode code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects eight hex digits, not four:

>>> s = "a\xac\u1234\u20ac\U00008000"
          ^^^^ two-digit hex escape
               ^^^^^ four-digit Unicode escape
                          ^^^^^^^^^^ eight-digit Unicode escape
>>> for c in s:  print(ord(c), end=" ")
...
97 172 4660 8364 32768

Using escape sequences for code points greater than 127 is fine in small doses, but becomes an annoyance if you’re using many accented characters, as you would in a program with messages in French or some other accent-using language. You can also assemble strings using the chr() built-in function, but this is even more tedious.

Ideally, you’d want to be able to write literals in your language’s natural encoding. You could then edit Python source code with your favorite editor which would display the accented characters naturally, and have the right characters used at runtime.

Python supports writing source code in UTF-8 by default, but you can use almost any encoding if you declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

u = 'abcdé'
print(ord(u[-1]))

The syntax is inspired by Emacs’s notation for specifying variables local to a file. Emacs supports many different variables, but Python only supports ‘coding’. The -*- symbols indicate to Emacs that the comment is special; they have no significance to Python but are a convention. Python looks for coding: name or coding=name in the comment.

If you don’t include such a comment, the default encoding used will be UTF-8.



Most Viewed Articles (in Python )

Latest Articles (in Python)

Comment on this tutorial