Strings and unicode in Python
20 Sep 2010
This post wil be about strings and encodings in Python, with an eye to Python 3 migration. Hopefully, you will find some useful rules of thumbs to to navigate these sometimes perplexing questions.
'String' types
What you commonly call a string can be two different things:
- bytes
- more low level, it's the actual byte sequence representing the data
- unicode
- more abstract, it can be thought as the sequence of human meanungful symbols which we use in written communication and that you learn at school.
Now this is extremely rough and even incorrect (unicode is just one standard, there are non unicode encodings) , but you the aim is to get a working mental model. The important thing is that, seen from this very high altitude, both these concepts exist in both versions of Python, but they have changed name.
Encoding/decoding
Where the practical problems arise is the encoding/decoding dance.
To get the unicode from the bytes, you decode. To get the bytes from unicode, you encode. This is confusing, as unicode is a kind of encoding, so you might think that to get the unicode you need to encode. No! You decode the bytes to get unicode! And you encode the unicode to get bytes!
This might be another way of defining bytes and unicode: if it needs decoding to become more abstract, alphabet or ideogram-like, it's a byte. If it need encoding to become low level, it's unicode.
Now, things are much the same in Python 3 and 2, except that in Python 2 what you think of as strings (things between quotation marks) are bytes by default (wihout u in front), while in Python 3 they are unicode by default (and putting an u in front is invalid syntax). It's almost as if the unicode type in Python 2 had become the string type in Python 3, and the Python 2 had simply evaporated (actually not really, we have bytes, which have a much simplified API). The Python 2 strings need to be decoded (they are more like bytes), the Python 3 strings are already decoded (they are unicode).
Examples
In your interactive interpreter:
Python 2:
>>> type('peacock')
<type 'str'>
>>> 'peacock'.decode('utf-8')
u'peacock'
>>>type('peacock'.decode('utf-8'))
<type 'unicode'>
>>>type(u'peacock'.encode('utf-8'))
<type 'str'>
You might have heard that to prepare a migration to Python 3, you can use
from __future__ import unicode_literals
in your Python 2 code.
Now, let it be clear what this does. This does not replace the string type in Python 2 with the one in Python 3. Type str are still encoded (read bytes) strings. What this does, is it changes the type of unadorned string literals, that is things such as 'peacock' which in Python 2 are (Python 2) str, with unicode_literals will be (Python 2) unicode.
>>> from __future__ import unicode_literals
>>> type('peacock')
<type 'unicode'>
>>>type('peacock'.encode('utf-8'))
<type 'str'>
Now Python 3:
>>> type('peacock')
<class 'str'>
>>> 'peacock'.decode('utf-8')
Throws an AttributeError. Even though the type is still named 'str', it is now a unicode string and is already 'decoded'. But:
>>> type('peacock'.encode('utf-8'))
<class 'bytes'>
Now some input/output. Python 2:
>>> with open('LICENSE.txt', 'r') as f:
... text = f.read()
...
>>> type(text)
<type 'str'>
>>> type(text.decode('utf-8'))
<type 'unicode'>
Python 3:
>>>with open('example.txt', 'r') as f:
... text = f.read()
...
>>> type(text)
<class 'str'>
>>> type(text.decode('utf-8'))
Throws AttributeError; input/output functions return unicode by default.
Conclusion
To sum up, there are still things called 'str' in Python 3, but in Python 3 it behaves much more like a Python 2 'unicode'.
As a consequence, there is no 'unicode' type anymore in Python 3.
from __future__ import unicode_literals will make quotation marks-enclosed thingies 'unicode' objects in Python 2. It will not transform them into Python 3 'str', nor will it introduce Python 3 'str' anywhere.
In short, if something expects unicode in Python 2, it can take Python 3 str without too many problems. The problem is if something expects Python 2 'str', as Python 3 'str' are a totally different thing. The closest approximation is the Python 3 byte. But it lacks many of the Python 2 'str' methods (which may cause problems, see the WSGI discussion).
Postscript: some more encodings
You might wonder what that 'utf-8' parameter means. Well, as I said at the beginning, there various ways of encoding characters. Most of the time in new programs you will want to use utf-8, but you might also need to handle other encodings. Some are defined in Unicode, most not. It really does not matter from the practical standpoint. For instance, if you work with files originating from Western Europe, you will often be confronted with the ISO-8859-1 (commonly named latin-1) and Windows-1252 (which, characteristically, is almost the same). These are not defined in the Unicode standard, but they are handled in exactly the same way in Python programs. That is, replace 'utf-8' with the appropriate string you can obtain from this handy reference.