TypeError: 'str' does not support the buffer interface suggests two possible methods to convert a string to bytes:
b = bytes(mystring, 'utf-8')
b = mystring.encode('utf-8')
Which method is more Pythonic?
bytes(item, "utf8")
, as explicit is better than implicit, so... str.encode( )
defaults silently to bytes, making you more Unicode-zen but less Explicit-Zen. Also "common" is not a term that i like to follow. Also, bytes(item, "utf8")
, is more like the str()
, and b"string"
notations. My apologies if i am so noob to understand your reasons. Thank you.
encode()
doesn't call bytes()
, it's the other way around. Of course that's not immediately obvious which is why I asked the question.
If you look at the docs for bytes
, it points you to bytearray
:
bytearray([source[, encoding[, errors]]]) Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. It has most of the usual methods of mutable sequences, described in Mutable Sequence Types, as well as most methods that the bytes type has, see Bytes and Byte Array Methods. The optional source parameter can be used to initialize the array in a few different ways: If it is a string, you must also give the encoding (and optionally, errors) parameters; bytearray() then converts the string to bytes using str.encode(). If it is an integer, the array will have that size and will be initialized with null bytes. If it is an object conforming to the buffer interface, a read-only buffer of the object will be used to initialize the bytes array. If it is an iterable, it must be an iterable of integers in the range 0 <= x < 256, which are used as the initial contents of the array. Without an argument, an array of size 0 is created.
So bytes
can do much more than just encode a string. It's Pythonic that it would allow you to call the constructor with any type of source parameter that makes sense.
For encoding a string, I think that some_string.encode(encoding)
is more Pythonic than using the constructor, because it is the most self documenting -- "take this string and encode it with this encoding" is clearer than bytes(some_string, encoding)
-- there is no explicit verb when you use the constructor.
I checked the Python source. If you pass a unicode string to bytes
using CPython, it calls PyUnicode_AsEncodedString, which is the implementation of encode
; so you're just skipping a level of indirection if you call encode
yourself.
Also, see Serdalis' comment -- unicode_string.encode(encoding)
is also more Pythonic because its inverse is byte_string.decode(encoding)
and symmetry is nice.
It's easier than it is thought:
my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
print(type(my_str_as_bytes)) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
print(type(my_decoded_str)) # ensure it is string representation
you can verify by printing the types. Refer to output below.
<class 'bytes'>
<class 'str'>
obj.method()
syntax instead of cls.method(obj)
syntax i.e., use bytestring = unicode_text.encode(encoding)
and unicode_text = bytestring.decode(encoding)
.
self
as the first argument
encode
as a bound method on the string. This answer suggests that you should instead call the unbound method and pass it the string. That's the only new information in the answer, and it's wrong.
The absolutely best way is neither of the 2, but the 3rd. The first parameter to encode
defaults to 'utf-8'
ever since Python 3.0. Thus the best way is
b = mystring.encode()
This will also be faster, because the default argument results not in the string "utf-8"
in the C code, but NULL
, which is much faster to check!
Here be some timings:
In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop
In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop
Despite the warning the times were very stable after repeated runs - the deviation was just ~2 per cent.
Using encode()
without an argument is not Python 2 compatible, as in Python 2 the default character encoding is ASCII.
>>> 'äöä'.encode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
'\u00012345'*10000
. Both take 28.8us on my laptop; the extra 50ns is presumably lost in the rounding error. Of course this is a pretty extreme example—but 'abc'
is just as extreme in the opposite direction.
'utf-8'
parameter is to be preferred. But you've definitely shown that leaving off the parameter is faster. That makes this a good answer, even if it isn't the best one.
int(s, 10)
;-)
Answer for a slightly different problem:
You have a sequence of raw unicode that was saved into a str variable:
s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
You need to be able to get the byte literal of that unicode (for struct.unpack(), etc.)
s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
Solution:
s_new: bytes = bytes(s, encoding="raw_unicode_escape")
Reference (scroll up for standard encodings):
How about the Python 3 'memoryview' way.
Memoryview is a sort of mishmash of the byte/bytearray and struct modules, with several benefits.
Not limited to just text and bytes, handles 16 and 32 bit words too
Copes with endianness
Provides a very low overhead interface to linked C/C++ functions and data
Simplest example, for a byte array:
memoryview(b"some bytes").tolist()
[115, 111, 109, 101, 32, 98, 121, 116, 101, 115]
Or for a unicode string, (which is converted to a byte array)
memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).tolist()
[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]
#Another way to do the same
memoryview("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020".encode("UTF-16")).tolist()
[255, 254, 117, 0, 110, 0, 105, 0, 99, 0, 111, 0, 100, 0, 101, 0, 32, 0]
Perhaps you need words rather than bytes?
memoryview(bytes("\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020", "UTF-16")).cast("H").tolist()
[65279, 117, 110, 105, 99, 111, 100, 101, 32]
memoryview(b"some more data").cast("L").tolist()
[1701670771, 1869422624, 538994034, 1635017060]
Word of caution. Be careful of multiple interpretations of byte order with data of more than one byte:
txt = "\u0075\u006e\u0069\u0063\u006f\u0064\u0065\u0020"
for order in ("", "BE", "LE"):
mv = memoryview(bytes(txt, f"UTF-16{order}"))
print(mv.cast("H").tolist())
[65279, 117, 110, 105, 99, 111, 100, 101, 32]
[29952, 28160, 26880, 25344, 28416, 25600, 25856, 8192]
[117, 110, 105, 99, 111, 100, 101, 32]
Not sure if that's intentional or a bug but it caught me out!!
The example used UTF-16, for a full list of codecs see Codec registry in Python 3.10
Success story sharing
unicode_string.encode(encoding)
matches nicely withbytearray.decode(encoding)
when you want your string back.bytearray
is used when you need a mutable object. You don't need it for simplestr
↔bytes
conversions.bytearray
except that the docs forbytes
don't give details, they just say "this is an immutable version ofbytearray
" so I have to quote from there.byte_string.decode('latin-1')
asutf-8
doesn't cover the entire range 0x00 to 0xFF (0-255), check out the python docs for more info.tl;dr
would be helpful