Chapter 4. Text versus Bytes
Humans use text. Computers speak bytes.1
Esther Nam and Travis Fischer, Character Encoding and Unicode in Python
Python 3 introduced a sharp distinction between strings of human text and sequences of raw bytes. Implicit conversion of byte sequences to Unicode text is a thing of the past. This chapter deals with Unicode strings, binary sequences, and the encodings used to convert between them.
Depending on your Python programming context, a deeper understanding of Unicode may or may not be of vital importance to you. In the end, most of the issues covered in this chapter do not affect programmers who deal only with ASCII text. But even if that is your case, there is no escaping the str
versus byte
divide. As a bonus, you’ll find that the specialized binary sequence types provide features that the “all-purpose” Python 2 str
type does not have.
In this chapter, we will visit the following topics:
-
Characters, code points, and byte representations
-
Unique features of binary sequences:
bytes
,bytearray
, andmemoryview
-
Codecs for full Unicode and legacy character sets
-
Avoiding and dealing with encoding errors
-
Best practices when handling text files
-
The default encoding trap and standard I/O issues
-
Safe Unicode text comparisons with normalization
-
Utility functions for normalization, case folding, and brute-force diacritic removal
-
Proper sorting of Unicode text with
locale
and the PyUCA library -
Character metadata in the Unicode database ...