Apr 2009

Why I love Python 3.0: Unicode + UTF-8

tl;dr summary

Python pre-3.0 Python post-3.0
str.encode bytes.translate or (new) str.encode
str.decode bytes.decode
unicode str
unicode.encode str.encode
unicode.decode *n/a*
str("x") == unicode("x")    bytes("x") != str("x")

This change in Python 3.0 might be more than useful for anybody intending to write programs that use more than the ASCII characters (A-z, 0-9, and some symbols), which, given how i18n'ed most applications are today, is rather the norm than the exception. I also hope to encourage my fellow Pythoneers to update to 3.0 as soon as humanly possible, not only because of this change, but because of the general advantages of Python 3.0 (aka "no-where near 3000"...).

In case you do not understand the difference between Unicode and String arrays, here is a short paragraph to get you started. A String (str in pre-3.0 Python, bytes/bytearray in Python 3.x+) is a byte-array already bound to a specific character-lookup table (e.g. ASCII, Latin-1, UTF-8, etc.) to find the correct representation for that String. Note that this is not the glyph itself you see on-screen, as this depends on, e.g., what font you are using, and is handled by the GUI toolkit or the terminal. A Unicode array (unicode in pre-3.0, str in 3.x+) on the other hand is an array of "universal" bytes, so-called code-points usually managed as two-byte arrays, but has no native representation. Therefore, to create something readable from an Unicode object, you have to encode its bytes by using a codetable, such as ASCII or UTF-16, to the correct String representation ("bind the Unicode array to a code table"). On the contrary, to create a Unicode array from a String array, you need to decode ("unbind") the String's coding to get the "universal" (in quotation marks as not all programming langues have to use base 16 integers (aka hex, or two bytes)) Unicode. If you are not used to thinking in these terms, a general tip for pre-3.0 Python: your program should, when handling String input (SAX parsers for example already do the conversion for you), convert it to Unicode (decode the Strings), and when outputing your Unicode arrays, convert them back to the desired String representation (encode them) - while working with Unicode internally to avoid bugs and possible exploits. A (rather stupid, but you can interpolate the danger, I hope) snippet from Python's Unicode HOWTO might exemplify this:

def read_file(filename, encoding):
    if '/' in filename:
        raise ValueError(u"'/' not allowed in filename")
        return open(filename.decode(encoding), 'r')

Looks good at first, but what about sending that function a String not in any standard encoding? For example, the UTF-7 encoding for u"/etc/passwd" is "+AC8-etc+AC8-passwd" - a nasty mistake if that file is presented to a user... (the work-around in this trivial example is obvious: just decode before the if-clause - or, even better, when the string enters your program - and compare to u'/'). To summarize, in Python (not so in C, for example!) a Unicode array consists of two-byte elements (base 16 integers) called code-points, Strings are arrays of bytes which are bound to a codetable that helps the Python interpreter look up the bytes' character representations and send them to your terminal or GUI. Unicode to String conversion is called encoding ("binding"), String to Unicode conversion is decoding ("unbinding"). The fact that, when using the Python shell, you see "real" characters for a String or Unicode object is pure convenience and should not distract you from how they truly work internally.

After this lengthy Unicode vs. String intro, the best news first: if you can allow yourself the luxury to program with any Python version and are not dependent on external libraries, Python 3.0 is just made for you: The new native String object is always a Unicode representation, and the default encoding chosen for representing your strings is UTF-8. In other words, if you use Python 3.0 and are happy with UTF-8, you no longer have to worry about decoding your (byte) strings to Unicode arrays or binding your Unicode code-points to the right (byte-) string representations. While this might seem like something that should have been done long ago, for historic reasons older programming languages (plus Python pre-3.0) use ASCII as the default encoding, meaning you had to look after de-/encoding the whole time when working with input/output functionality of your programs and using most other languages other than English - and even there you might want to have special characters (don't be so naïve...). Sad side to this: what I am talking about here is standard in Java...

However, you no longer need to worry with 3.0: First, the totally useless old String object (str) has been removed (to be exact, it could be said it is now "integrated" into the bytes and bytearray objects), including the even more ridiculous "encode" method for old str objects: bytes and bytearray only support a "decode" message (to the new Unicode str objects), while the intended use of str.encode, transforming Byte objects that were represented as str objects in pre-3.0, like zip or base64, now has to be done through a new method called "translate" on the new bytes and bytearray objects in 3.0, or via encode on the new str object. This was a dangerous duck typing strategy to have str.encode in pre-3.0: as Unicode objects can and should have this method, too, but as you could not tell if you were calling encode on a Unicode object or a String object (without something like writing:

assert isinstance(my_obj, unicode)

before every call to encode, at least), you could have been decoding Unicode and encoding Strings - and because Python was (yes, was (!) - see below) as "nice" as to do auto-coercion for you, without very thorough testing libraries such a bug could go unnoticed for a long time in pre-3.0. So, my praise to whomever was responsible for that decision!

On the other hand, the unicode object is now the new str object, sans the even more useless and dangerous "decode" functionality: the new (Unicode) str object only supports str.encode (for cases where you want something else than UTF-8), while str.decode is finally dropped from the Python Standarad Library. Obviously, you might have a system that does not want UTF-8, and encoding your Unicode str to whatever schema you need with str.encode the whole time would be a pain; To define a different encoding globally, Python uses your "coding" declaration in the first lines of your program as the default encoding schema for all your new, shiny Unicode str objects. I.e., writing:

# -*- coding: funny-arab-dialect -*-

will be enough if you have some strange language sporting glyphs that require characters not found in the Unicode consortium's codetables, or you might want to set it back to ASCII (the default in pre-3.0) if you really need to ensure nothing other than good, old "7-bit" is output by your program. On a side note: UTF-8 is compatible with ASCII, while UTF-16 is not; i.e., an ASCII string encoded using the UTF-8 codetable still gives the right characters, trying this with UTF-16 encoding does not - and a good explanation why we have still not moved to UTF-16 in general.

Finally, the really dangerous auto-coercion of Python between Strings, Unicode representations, and Byte arrays is gone for good. Your message's argument types must now match the receiving object's type and comparisons between the different types always evaluate to false. This last change might sound drastic if seen from a purely rapid prototyping view, but everybody with some intent on not going crazy while programming will greatly appreciate this change. The bugs and exploits stemming from wrong (en/de-) coding, or, let's say, too much duck typing the str and unicode objects in pre-3.0 Python (yeah, I love to put the fault on somebody else...) are finally gone! Also, as all Strings are now represented as Unicode str objects, you no longer need to worry if, while comparing two str objects, they are using the same encoding - which was another fountain of bugs in pre-3.0 Python - as any String is internally managed as universal Unicode.

What is left to say? These changes are dramatic (even if they should have been made already long ago with 2.0), and it will take a while until Python 3.0 will have replaced 2.7 (the final, upcoming stable 2.x release, which will warn you about code that will break with 3.0). But the message should be clear: the effort of converting your libraries to the next generation of Python is more than worth it, and the 2to3 converter should help if you had your encoding/decoding correct. If not, converting to 3.0 might help you uncover some nasty bugs you were not even aware of! Other reasons to "convert" would be:

  • no more longs, which are now ints and unlimited in size (think of what happend when reaching maxint before...),
  • generator/views from most operations formerly returning lists (think: time used for creating and garbage collecting those temporary lists),
  • function annotations for metaclassing and advanced decorators,
  • nonlocal scope (similar to LISPs lexical scope),
  • dictionary comprehensions ("{k: v for k, v in my_dict}") and set literals ("my_set = {1, 2}"),
  • and tons of streamlining the syntax and Standard Library.