Here are my attempts with error messages. What am I doing wrong?
string.decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 37: ordinal not in range(128)
string.encode('utf-8', "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 37: ordinal not in range(128)
string
? What type is it?
Guessing at all the things omitted from the original question, but, assuming Python 2.x the key is to read the error messages carefully: in particular where you call 'encode' but the message says 'decode' and vice versa, but also the types of the values included in the messages.
In the first example string
is of type unicode
and you attempted to decode it which is an operation converting a byte string to unicode. Python helpfully attempted to convert the unicode value to str
using the default 'ascii' encoding but since your string contained a non-ascii character you got the error which says that Python was unable to encode a unicode value. Here's an example which shows the type of the input string:
>>> u"\xa0".decode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
u"\xa0".decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
In the second case you do the reverse attempting to encode a byte string. Encoding is an operation that converts unicode to a byte string so Python helpfully attempts to convert your byte string to unicode first and, since you didn't give it an ascii string the default ascii decoder fails:
>>> "\xc2".encode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
"\xc2".encode("ascii", "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Aside from getting decode
and encode
backwards, I think part of the answer here is actually don't use the ascii
encoding. It's probably not what you want.
To begin with, think of str
like you would a plain text file. It's just a bunch of bytes with no encoding actually attached to it. How it's interpreted is up to whatever piece of code is reading it. If you don't know what this paragraph is talking about, go read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets right now before you go any further.
Naturally, we're all aware of the mess that created. The answer is to, at least within memory, have a standard encoding for all strings. That's where unicode
comes in. I'm having trouble tracking down exactly what encoding Python uses internally for sure, but it doesn't really matter just for this. The point is that you know it's a sequence of bytes that are interpreted a certain way. So you only need to think about the characters themselves, and not the bytes.
The problem is that in practice, you run into both. Some libraries give you a str
, and some expect a str
. Certainly that makes sense whenever you're streaming a series of bytes (such as to or from disk or over a web request). So you need to be able to translate back and forth.
Enter codecs
: it's the translation library between these two data types. You use encode
to generate a sequence of bytes (str
) from a text string (unicode
), and you use decode
to get a text string (unicode
) from a sequence of bytes (str
).
For example:
>>> s = "I look like a string, but I'm actually a sequence of bytes. \xe2\x9d\xa4"
>>> codecs.decode(s, 'utf-8')
u"I look like a string, but I'm actually a sequence of bytes. \u2764"
What happened here? I gave Python a sequence of bytes, and then I told it, "Give me the unicode
version of this, given that this sequence of bytes is in 'utf-8'
." It did as I asked, and those bytes (a heart character) are now treated as a whole, represented by their Unicode codepoint.
Let's go the other way around:
>>> u = u"I'm a string! Really! \u2764"
>>> codecs.encode(u, 'utf-8')
"I'm a string! Really! \xe2\x9d\xa4"
I gave Python a Unicode string, and I asked it to translate the string into a sequence of bytes using the 'utf-8'
encoding. So it did, and now the heart is just a bunch of bytes it can't print as ASCII; so it shows me the hexadecimal instead.
We can work with other encodings, too, of course:
>>> s = "I have a section \xa7"
>>> codecs.decode(s, 'latin1')
u'I have a section \xa7'
>>> codecs.decode(s, 'latin1')[-1] == u'\u00A7'
True
>>> u = u"I have a section \u00a7"
>>> u
u'I have a section \xa7'
>>> codecs.encode(u, 'latin1')
'I have a section \xa7'
('\xa7'
is the section character, in both Unicode and Latin-1.)
So for your question, you first need to figure out what encoding your str
is in.
Did it come from a file? From a web request? From your database? Then the source determines the encoding. Find out the encoding of the source and use that to translate it into a unicode. s = [get from external source] u = codecs.decode(s, 'utf-8') # Replace utf-8 with the actual input encoding
Or maybe you're trying to write it out somewhere. What encoding does the destination expect? Use that to translate it into a str. UTF-8 is a good choice for plain text documents; most things can read it. u = u'My string' s = codecs.encode(u, 'utf-8') # Replace utf-8 with the actual output encoding [Write s out somewhere]
Are you just translating back and forth in memory for interoperability or something? Then just pick an encoding and stick with it; 'utf-8' is probably the best choice for that: u = u'My string' s = codecs.encode(u, 'utf-8') newu = codecs.decode(s, 'utf-8')
In modern programming, you probably never want to use the 'ascii'
encoding for any of this. It's an extremely small subset of all possible characters, and no system I know of uses it by default or anything.
Python 3 does its best to make this immensely clearer simply by changing the names. In Python 3, str
was replaced with bytes
, and unicode
was replaced with str
.
x.decode(some_encoding).encode(some_encoding) == x
be True for any byte string x and any encoding for which the decoding does not fail (with UnicodeDecodeError)? (using the py3 notation not codecs, type(x) == bytes
That's because your input string can’t be converted according to the encoding rules (strict by default).
I don't know, but I always encoded using directly unicode() constructor, at least that's the ways at the official documentation:
unicode(your_str, errors="ignore")
unicode("\xe2\x9d\xa4", errors='ignore')
gives u''
.) If that's an acceptable outcome, then this might be okay. I can't imagine that losing data is okay in the majority of situations, though. At the very least, this answer needs to expound on the appropriateness of doing that.
str
is really a byte sequence (in Python 2.x). You decode a UTF-8 byte sequence into a Unicode string, and you encode a Unicode string into a UTF-8 byte sequence. In other words, what's wrong is the namestr
(instead of the encode/decode order), which they fixed in 3.x.