I captured the standard output of an external program into a bytes
object:
>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>>
>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2\n'
I want to convert that to a normal Python string, so that I can print it like this:
>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2
I tried the binascii.b2a_qp()
method, but got the same bytes
object again:
>>> binascii.b2a_qp(command_stdout)
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2\n'
How do I convert the bytes
object to a str
with Python 3?
str(text_bytes)
work? This seems bizarre to me.
str(text_bytes)
can't specify the encoding. Depending on what's in text_bytes, text_bytes.decode('cp1250
)` might result in a very different string to text_bytes.decode('utf-8')
.
str
function does not convert to a real string anymore. One HAS to say an encoding explicitly for some reason I am to lazy to read through why. Just convert it to utf-8
and see if ur code works. e.g. var = var.decode('utf-8')
unicode_text = str(bytestring, character_encoding)
works as expected on Python 3. Though unicode_text = bytestring.decode(character_encoding)
is more preferable to avoid confusion with just str(bytes_obj)
that produces a text representation for bytes_obj
instead of decoding it to text: str(b'\xb6', 'cp1252') == b'\xb6'.decode('cp1252') == '¶'
and str(b'\xb6') == "b'\\xb6'" == repr(b'\xb6') != '¶'
Decode the bytes
object to produce a string:
>>> b"abcde".decode("utf-8")
'abcde'
The above example assumes that the bytes
object is in UTF-8, because it is a common encoding. However, you should use the encoding your data is actually in!
Decode the byte string and turn it in to a character (Unicode) string.
Python 3:
encoding = 'utf-8'
b'hello'.decode(encoding)
or
str(b'hello', encoding)
Python 2:
encoding = 'utf-8'
'hello'.decode(encoding)
or
unicode('hello', encoding)
variable = b'hello'
, then unicode_text = variable.decode(character_encoding)
variable = variable.decode()
automagically got it into a string format I wanted.
encoding
arg if you do not supply it. See bytes.decode
This joins together a list of bytes into a string:
>>> bytes_data = [112, 52, 52]
>>> "".join(map(chr, bytes_data))
'p44'
a.decode('latin-1')
where a = bytearray([112, 52, 52])
("There Ain't No Such Thing as Plain Text". If you've managed to convert bytes into a text string then you used some encoding—latin-1
in this case)
bytes([112, 52, 52])
- btw bytes is a bad name for a local variable exactly because it's a p3 builtin
If you don't know the encoding, then to read binary input into string in Python 3 and Python 2 compatible way, use the ancient MS-DOS CP437 encoding:
PY3K = sys.version_info >= (3, 0)
lines = []
for line in stream:
if not PY3K:
lines.append(line)
else:
lines.append(line.decode('cp437'))
Because encoding is unknown, expect non-English symbols to translate to characters of cp437
(English characters are not translated, because they match in most single byte encodings and UTF-8).
Decoding arbitrary binary input to UTF-8 is unsafe, because you may get this:
>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte
The same applies to latin-1
, which was popular (the default?) for Python 2. See the missing points in Codepage Layout - it is where Python chokes with infamous ordinal not in range
.
UPDATE 20150604: There are rumors that Python 3 has the surrogateescape
error strategy for encoding stuff into binary data without data loss and crashes, but it needs conversion tests, [binary] -> [str] -> [binary]
, to validate both performance and reliability.
UPDATE 20170116: Thanks to comment by Nearoo - there is also a possibility to slash escape all unknown bytes with backslashreplace
error handler. That works only for Python 3, so even with this workaround you will still get inconsistent output from different Python versions:
PY3K = sys.version_info >= (3, 0)
lines = []
for line in stream:
if not PY3K:
lines.append(line)
else:
lines.append(line.decode('utf-8', 'backslashreplace'))
See Python’s Unicode Support for details.
UPDATE 20170119: I decided to implement slash escaping decode that works for both Python 2 and Python 3. It should be slower than the cp437
solution, but it should produce identical results on every Python version.
# --- preparation
import codecs
def slashescape(err):
""" codecs error handler. err is UnicodeDecode instance. return
a tuple with a replacement for the unencodable part of the input
and a position where encoding should continue"""
#print err, dir(err), err.start, err.end, err.object[:err.start]
thebyte = err.object[err.start:err.end]
repl = u'\\x'+hex(ord(thebyte))[2:]
return (repl, err.end)
codecs.register_error('slashescape', slashescape)
# --- processing
stream = [b'\x80abc']
lines = []
for line in stream:
lines.append(line.decode('utf-8', 'slashescape'))
b'\x00\x01\xffsd'.decode('utf-8', 'ignore')
in python 3.
b'\x80abc'.decode("utf-8", "backslashreplace")
will result in '\\x80abc'
. This information was taken from the unicode documentation page which seems to have been updated since the writing of this answer.
In Python 3, the default encoding is "utf-8"
, so you can directly use:
b'hello'.decode()
which is equivalent to
b'hello'.decode(encoding="utf-8")
On the other hand, in Python 2, encoding defaults to the default string encoding. Thus, you should use:
b'hello'.decode(encoding)
where encoding
is the encoding you want.
Note: support for keyword arguments was added in Python 2.7.
I think you actually want this:
>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')
Aaron's answer was correct, except that you need to know which encoding to use. And I believe that Windows uses 'windows-1252'. It will only matter if you have some unusual (non-ASCII) characters in your content, but then it will make a difference.
By the way, the fact that it does matter is the reason that Python moved to using two different types for binary and text data: it can't convert magically between them, because it doesn't know the encoding unless you tell it! The only way YOU would know is to read the Windows documentation (or read it here).
open()
function for text streams or Popen()
if you pass it universal_newlines=True
do magically decide character encoding for you (locale.getpreferredencoding(False)
in Python 3.3+).
'latin-1'
is a verbatim encoding with all code points set, so you can use that to effectively read a byte string into whichever type of string your Python supports (so verbatim on Python 2, into Unicode for Python 3).
'latin-1'
is a good way to get mojibake. Also there are magical substitution on Windows: it is surprisingly hard to pipe data from one process to another unmodified e.g., dir
: \xb6
-> \x14
(the example at the end of my answer)
Since this question is actually asking about subprocess
output, you have more direct approaches available. The most modern would be using subprocess.check_output
and passing text=True
(Python 3.7+) to automatically decode stdout using the system default coding:
text = subprocess.check_output(["ls", "-l"], text=True)
For Python 3.6, Popen
accepts an encoding keyword:
>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt
The general answer to the question in the title, if you're not dealing with subprocess output, is to decode bytes to text:
>>> b'abcde'.decode()
'abcde'
With no argument, sys.getdefaultencoding()
will be used. If your data is not sys.getdefaultencoding()
, then you must specify the encoding explicitly in the decode
call:
>>> b'caf\xe9'.decode('cp1250')
'café'
encoding
parameter is given, then the text
parameter is ignored.
subprocess
. Maybe still emphasize how Popen
is almost always the wrong tool if you just want to wait for the subprocess and get its result; like the documentation says, use subprocess.run
or one of the legacy functions check_call
or check_output
.
Set universal_newlines to True, i.e.
command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]
To interpret a byte sequence as a text, you have to know the corresponding character encoding:
unicode_text = bytestring.decode(character_encoding)
Example:
>>> b'\xc2\xb5'.decode('utf-8')
'µ'
ls
command may produce output that can't be interpreted as text. File names on Unix may be any sequence of bytes except slash b'/'
and zero b'\0'
:
>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()
Trying to decode such byte soup using utf-8 encoding raises UnicodeDecodeError
.
It can be worse. The decoding may fail silently and produce mojibake if you use a wrong incompatible encoding:
>>> '—'.encode('utf-8').decode('cp1252')
'—'
The data is corrupted but your program remains unaware that a failure has occurred.
In general, what character encoding to use is not embedded in the byte sequence itself. You have to communicate this info out-of-band. Some outcomes are more likely than others and therefore chardet
module exists that can guess the character encoding. A single Python script may use multiple character encodings in different places.
ls
output can be converted to a Python string using os.fsdecode()
function that succeeds even for undecodable filenames (it uses sys.getfilesystemencoding()
and surrogateescape
error handler on Unix):
import os
import subprocess
output = os.fsdecode(subprocess.check_output('ls'))
To get the original bytes, you could use os.fsencode()
.
If you pass universal_newlines=True
parameter then subprocess
uses locale.getpreferredencoding(False)
to decode bytes e.g., it can be cp1252
on Windows.
To decode the byte stream on-the-fly, io.TextIOWrapper()
could be used: example.
Different commands may use different character encodings for their output e.g., dir
internal command (cmd
) may use cp437. To decode its output, you could pass the encoding explicitly (Python 3.6+):
output = subprocess.check_output('dir', shell=True, encoding='cp437')
The filenames may differ from os.listdir()
(which uses Windows Unicode API) e.g., '\xb6'
can be substituted with '\x14'
—Python's cp437 codec maps b'\x14'
to control character U+0014 instead of U+00B6 (¶). To support filenames with arbitrary Unicode characters, see Decode PowerShell output possibly containing non-ASCII Unicode characters into a Python string
While @Aaron Maenpaa's answer just works, a user recently asked:
Is there any more simply way? 'fhand.read().decode("ASCII")' [...] It's so long!
You can use:
command_stdout.decode()
decode()
has a standard argument:
codecs.decode(obj, encoding='utf-8', errors='strict')
.decode()
that uses 'utf-8'
may fail (command's output may use a different character encoding or even return an undecodable byte sequence). Though if the input is ascii (a subset of utf-8) then .decode()
works.
If you should get the following by trying decode()
:
AttributeError: 'str' object has no attribute 'decode'
You can also specify the encoding type straight in a cast:
>>> my_byte_str
b'Hello World'
>>> str(my_byte_str, 'utf-8')
'Hello World'
If you have had this error:
utf-8 codec can't decode byte 0x8a
,
then it is better to use the following code to convert bytes to a string:
bytes = b"abcdefg"
string = bytes.decode("utf-8", "ignore")
I made a function to clean a list
def cleanLists(self, lista):
lista = [x.strip() for x in lista]
lista = [x.replace('\n', '') for x in lista]
lista = [x.replace('\b', '') for x in lista]
lista = [x.encode('utf8') for x in lista]
lista = [x.decode('utf8') for x in lista]
return lista
.strip
, .replace
, .encode
, etc calls in one list comprehension and only iterate over the list once instead of iterating over it five times.
When working with data from Windows systems (with \r\n
line endings), my answer is
String = Bytes.decode("utf-8").replace("\r\n", "\n")
Why? Try this with a multiline Input.txt:
Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)
All your line endings will be doubled (to \r\r\n
), leading to extra empty lines. Python's text-read functions usually normalize line endings so that strings use only \n
. If you receive binary data from a Windows system, Python does not have a chance to do that. Thus,
Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("\r\n", "\n")
open("Output.txt", "w").write(String)
will replicate your original file.
.replace("\r\n", "\n")
addition so long. This is the answer if you want to render HTML properly.
For Python 3, this is a much safer and Pythonic approach to convert from byte
to string
:
def byte_to_str(bytes_or_str):
if isinstance(bytes_or_str, bytes): # Check if it's in bytes
print(bytes_or_str.decode('utf-8'))
else:
print("Object not of byte type")
byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2\n')
Output:
total 0
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2
byte_to_str
" which implies it will return a str, but it only prints the converted value, and it prints an error message if it fails (but doesn't raise an exception). This approach is also unpythonic and obfuscates the bytes.decode
solution you provided.
For your specific case of "run a shell command and get its output as text instead of bytes", on Python 3.7, you should use subprocess.run
and pass in text=True
(as well as capture_output=True
to capture the output)
command_result = subprocess.run(["ls", "-l"], capture_output=True, text=True)
command_result.stdout # is a `str` containing your program's stdout
text
used to be called universal_newlines
, and was changed (well, aliased) in Python 3.7. If you want to support Python versions before 3.7, pass in universal_newlines=True
instead of text=True
From sys — System-specific parameters and functions:
To write or read binary data from/to the standard streams, use the underlying binary buffer. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc')
.
bytes
value.
Decode with .decode()
. This will decode the string. Pass in 'utf-8'
) as the value in the inside.
Bytes
m=b'This is bytes'
Converting to string
Method 1
m.decode("utf-8")
or
m.decode()
Method 2
import codecs
codecs.decode(m,encoding="utf-8")
or
import codecs
codecs.decode(m)
Method 3
str(m,encoding="utf-8")
or
str(m)[1:-1]
Result
'This is bytes'
def toString(string):
try:
return v.decode("utf-8")
except ValueError:
return string
b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))
If you want to convert any bytes, not just string converted to bytes:
with open("bytesfile", "rb") as infile:
str = base64.b85encode(imageFile.read())
with open("bytesfile", "rb") as infile:
str2 = json.dumps(list(infile.read()))
This is not very efficient, however. It will turn a 2 MB picture into 9 MB.
try this
bytes.fromhex('c3a9').decode('utf-8')
We can decode the bytes object to produce a string using bytes.decode(encoding='utf-8', errors='strict')
For documentation. Click here
Python3
example:
byte_value = b"abcde"
print("Initial value = {}".format(byte_value))
print("Initial value type = {}".format(type(byte_value)))
string_value = byte_value.decode("utf-8")
# utf-8 is used here because it is a very common encoding, but you need to use the encoding your data is actually in.
print("------------")
print("Converted value = {}".format(string_value))
print("Converted value type = {}".format(type(string_value)))
Output:
Initial value = b'abcde'
Initial value type = <class 'bytes'>
------------
Converted value = abcde
Converted value type = <class 'str'>
NOTE: In Python3 by default encoding type is utf-8
. So, <byte_string>.decode("utf-8")
can be also written as <byte_string>.decode()
Try using this one; this function will ignore all the non character set (like utf-8
) binaries and return a clean string. It is tested for python3.6
and above.
def bin2str(text, encoding = 'utf-8'):
"""Converts a binary to Unicode string by removing all non Unicode char
text: binary string to work on
encoding: output encoding *utf-8"""
return text.decode(encoding, 'ignore')
Here, the function will take the binary and decode it (converts binary data to characters using python predefined character set and the ignore
argument ignores all non-character set data from your binary and finally returns your desired string
value.
If you are not sure about the encoding, use sys.getdefaultencoding()
to get the default encoding of your device.
Success story sharing
"windows-1252"
is not reliable either (e.g., for other language versions of Windows), wouldn't it be best to usesys.stdout.encoding
?b"\x80\x02\x03".decode("utf-8")
->UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
.utf-8
conversion is likely to fail. Instead see @techtonik answer (below) stackoverflow.com/a/27527728/198536