将字节转换为字符串

python string python-3.x

我将外部程序的标准输出捕获到 bytes 对象中：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>>
>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

我想把它转换成一个普通的 Python 字符串，这样我就可以像这样打印它：

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

我尝试了 binascii.b2a_qp() 方法，但又得到了相同的 bytes 对象：

>>> binascii.b2a_qp(command_stdout)
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

如何使用 Python 3 将 bytes 对象转换为 str？

为什么 str(text_bytes) 不起作用？这对我来说似乎很奇怪。

@CharlieParker 因为 str(text_bytes) 无法指定编码。根据 text_bytes 中的内容，text_bytes.decode('cp1250)` 可能会导致与 text_bytes.decode('utf-8') 完全不同的字符串。

所以 str 函数不再转换为真正的字符串。由于某种原因，我不得不明确地说出一种编码，我懒得通读原因。只需将其转换为 utf-8 并查看您的代码是否有效。例如var = var.decode('utf-8')

@CraigAnderson：unicode_text = str(bytestring, character_encoding) 在 Python 3 上按预期工作。虽然 unicode_text = bytestring.decode(character_encoding) 更可取以避免与仅产生 bytes_obj 的文本表示而不是将其解码为文本的 str(bytes_obj) 混淆：str(b'\xb6', 'cp1252') == b'\xb6'.decode('cp1252') == '¶' 和 {6 }

Mateen Ulhaq

Decode the bytes object 产生一个字符串：

>>> b"abcde".decode("utf-8") 
'abcde'

上面的示例假设 bytes 对象是 UTF-8，因为它是一种常见的编码。但是，您应该使用数据实际所在的编码！

使用 "windows-1252" 也不可靠（例如，对于其他语言版本的 Windows），使用 sys.stdout.encoding 不是最好吗？

也许这将进一步帮助某人：有时您使用字节数组进行前 TCP 通信。如果要将字节数组转换为字符串以截断尾随 '\x00' 字符，则以下答案是不够的。然后使用 b'example\x00\x00'.decode('utf-8').strip('\x00') 。

我在 bugs.python.org/issue17860 填写了一个关于记录它的错误 - 请随时提出补丁。如果很难做出贡献 - 欢迎评论如何改进。

在 Python 2.7.6 中不处理 b"\x80\x02\x03".decode("utf-8") -> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte。

如果内容是随机二进制值，则 utf-8 转换可能会失败。请参阅@techtonik 答案（如下）stackoverflow.com/a/27527728/198536

Mateen Ulhaq

解码字节字符串并将其转换为字符 (Unicode) 字符串。

蟒蛇 3：

encoding = 'utf-8'
b'hello'.decode(encoding)

或者

str(b'hello', encoding)

蟒蛇2：

encoding = 'utf-8'
'hello'.decode(encoding)

或者

unicode('hello', encoding)

在 Python 3 上，如果字符串在变量中怎么办？

@AlaaM .：一样。如果您有 variable = b'hello'，那么 unicode_text = variable.decode(character_encoding)

对我来说，variable = variable.decode() 自动将其转换为我想要的字符串格式。

@亚历克斯霍尔>首先，您可能有兴趣知道 automagic 使用 utf8，如果您不提供它，它是 encoding arg 的默认值。请参阅bytes.decode

使用任何解码都会给我： AttributeError: 'str' object has no attribute 'decode'

Mateen Ulhaq

这将一个字节列表连接成一个字符串：

>>> bytes_data = [112, 52, 52]
>>> "".join(map(chr, bytes_data))
'p44'

谢谢，您的方法对我有用，而其他方法都没有。我有一个未编码的字节数组，需要将其转换为字符串。试图找到一种重新编码它的方法，以便我可以将它解码成一个字符串。这个方法非常有效！

@leetNightshade：但它的效率非常低。如果你有一个字节数组，你只需要解码。

@Martijn Pieters 我只是对这些其他答案做了一个简单的基准测试，运行了多次 10,000 次stackoverflow.com/a/3646405/353094，而上述解决方案实际上每次都快得多。在 Python 2.7.7 中运行 10,000 次需要 8 毫秒，而其他运行需要 12 毫秒和 18 毫秒。当然，根据输入、Python 版本等可能会有一些变化。对我来说似乎并不太慢。

@Sasszem：这种方法是一种变态的表达方式：a.decode('latin-1') where a = bytearray([112, 52, 52])（"There Ain't No Such Thing as Plain Text"。如果您设法将字节转换为文本字符串，那么您使用了某种编码 - 在这种情况下为 latin-1）

对于 python 3，这应该等同于 bytes([112, 52, 52]) - btw bytes 对于局部变量来说是一个坏名字，因为它是 p3 内置的

Peter Mortensen

如果您不知道编码，那么要以 Python 3 和 Python 2 兼容的方式将二进制输入读入字符串，请使用古老的 MS-DOS CP437 编码：

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

因为编码是未知的，所以非英文符号会翻译成 cp437 的字符（英文字符不会被翻译，因为它们在大多数单字节编码和 UTF-8 中都匹配）。

将任意二进制输入解码为 UTF-8 是不安全的，因为您可能会得到以下信息：

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

这同样适用于 latin-1，它在 Python 2 中很流行（默认？）。请参阅 Codepage Layout 中的缺失点 - 这是 Python 与臭名昭著的 ordinal not in range 窒息的地方。

更新 20150604：有传言称 Python 3 具有 surrogateescape 错误策略，可以将内容编码为二进制数据而不会丢失数据和崩溃，但它需要转换测试，[binary] -> [str] -> [binary]，以验证性能和可靠性。

更新 20170116：感谢 Nearoo 的评论 - 还有可能使用 backslashreplace 错误处理程序对所有未知字节进行斜线转义。这仅适用于 Python 3，因此即使使用此解决方法，您仍然会从不同的 Python 版本中获得不一致的输出：

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

有关详细信息，请参阅 Python’s Unicode Support。

更新 20170119：我决定实现适用于 Python 2 和 Python 3 的斜线转义解码。它应该比 cp437 解决方案慢，但它应该在每个 Python 版本上产生相同的结果。

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))

我真的觉得 Python 应该提供一种机制来替换丢失的符号并继续。

@techtonik：这不适用于像在 python2 中那样的数组。

您也可以在 python 3 中使用 b'\x00\x01\xffsd'.decode('utf-8', 'ignore') 忽略 unicode 错误。

@anatolytechtonik 有可能将转义序列留在字符串中并继续：b'\x80abc'.decode("utf-8", "backslashreplace") 将导致 '\\x80abc'。此信息取自 unicode documentation page，自撰写此答案以来似乎已更新。

据我对 ISO 8859 系列的理解，ISO 定义仅定义可打印字符，而不是控制代码，这就是为什么您在维基百科的表格中看到空白的原因。然而在实践中，代码 0-31 和 127-159 被映射到相应的 unicode 控制代码。因此，使用 ISO-8859-1（又名 latin1）解码任意字节是安全的（这也适用于某些但并非所有其他 ISO-8859 系列编码）。

Peter Mortensen

In Python 3，默认编码是"utf-8"，所以可以直接使用：

b'hello'.decode()

这相当于

b'hello'.decode(encoding="utf-8")

另一方面，in Python 2，编码默认为默认字符串编码。因此，您应该使用：

b'hello'.decode(encoding)

其中 encoding 是您想要的编码。

Note: 在 Python 2.7 中添加了对关键字参数的支持。

Peter Mortensen

我认为你实际上想要这个：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaron 的回答是正确的，只是您需要知道要使用哪种编码。而且我相信 Windows 使用“windows-1252”。仅当您的内容中有一些不寻常的（非 ASCII）字符时才重要，但它会有所作为。

顺便说一句，它确实很重要的事实是 Python 转向对二进制和文本数据使用两种不同类型的原因：它不能在它们之间进行神奇的转换，因为除非你告诉它，否则它不知道编码！您知道的唯一方法是阅读 Windows 文档（或在此处阅读）。

open() 用于文本流的函数或 Popen() 如果您传递它 universal_newlines=True 会神奇地为您决定字符编码（Python 3.3+ 中的 locale.getpreferredencoding(False)）。

'latin-1' 是设置了所有代码点的逐字编码，因此您可以使用它来有效地将字节字符串读入 Python 支持的任何类型的字符串（在 Python 2 上逐字读取，在 Python 3 上读取到 Unicode）。

@tripleee：'latin-1' 是获得 mojibake 的好方法。在 Windows 上也有神奇的替换：将数据从一个进程传送到另一个未修改的进程是非常困难的，例如 dir: \xb6 -> \x14 (the example at the end of my answer)

wim

由于这个问题实际上是在询问 subprocess 输出，因此您可以使用更直接的方法。最现代的方法是使用 subprocess.check_output 并传递 text=True (Python 3.7+) 以使用系统默认编码自动解码标准输出：

text = subprocess.check_output(["ls", "-l"], text=True)

对于 Python 3.6，Popen 接受 encoding 关键字：

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

如果您不处理子进程输出，则标题中问题的一般答案是将字节解码为文本：

>>> b'abcde'.decode()
'abcde'

如果没有参数，将使用 sys.getdefaultencoding()。如果您的数据不是 sys.getdefaultencoding()，那么您必须在 decode 调用中明确指定编码：

>>> b'caf\xe9'.decode('cp1250')
'café'

使用 utf-8 编码解码 ls 输出可能会失败（参见 my answer from 2016 中的示例）。

@Boris：如果给出 encoding 参数，则忽略 text 参数。

这是 subprocess 的正确答案。如果您只想等待子流程并获得其结果，也许仍然强调 Popen 几乎总是错误的工具；如文档所述，使用 subprocess.run 或旧功能 check_call 或 check_output 之一。

Borislav Sabev

将universal_newlines设置为True，即

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]

我一直在使用这种方法，并且有效。虽然，它只是根据系统上的用户偏好来猜测编码，所以它不像其他一些选项那么健壮。这就是它正在做的事情，引用 docs.python.org/3.4/library/subprocess.html：“如果 universal_newlines 为 True，则 [stdin、stdout 和 stderr] 将使用语言环境返回的编码以通用换行符模式作为文本流打开.getpreferredencoding（假）。”

On 3.7 您可以（并且应该）使用 text=True 而不是 universal_newlines=True。

jfs

要将字节序列解释为文本，您必须知道相应的字符编码：

unicode_text = bytestring.decode(character_encoding)

例子：

>>> b'\xc2\xb5'.decode('utf-8')
'µ'

ls 命令可能会产生无法解释为文本的输出。 Unix 上的文件名可以是除斜杠 b'/' 和零 b'\0' 之外的任何字节序列：

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

尝试使用 utf-8 编码解码这样的字节汤会引发 UnicodeDecodeError。

情况可能更糟。如果您使用错误的不兼容编码，解码可能会静默失败并产生 mojibake：

>>> '—'.encode('utf-8').decode('cp1252')
'â€”'

数据已损坏，但您的程序仍然不知道发生了故障。

一般来说，使用什么字符编码并不嵌入字节序列本身。您必须在带外传达此信息。某些结果比其他结果更有可能，因此存在可以猜测字符编码的 chardet 模块。一个 Python 脚本可能在不同的地方使用多个字符编码。

ls 输出可以使用 os.fsdecode() 函数转换为 Python 字符串，即使对于 undecodable filenames 也成功（它在 Unix 上使用 sys.getfilesystemencoding() 和 surrogateescape 错误处理程序）：

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

要获取原始字节，您可以使用 os.fsencode()。

如果您传递 universal_newlines=True 参数，则 subprocess 使用 locale.getpreferredencoding(False) 来解码字节，例如，它可以是 Windows 上的 cp1252。

要即时解码字节流，可以使用 io.TextIOWrapper()：example。

不同的命令可能对其输出使用不同的字符编码，例如，dir 内部命令 (cmd) 可能使用 cp437。要解码其输出，您可以显式传递编码（Python 3.6+）：

output = subprocess.check_output('dir', shell=True, encoding='cp437')

文件名可能与 os.listdir()（使用 Windows Unicode API）不同，例如，'\xb6' 可以替换为 '\x14' — Python 的 cp437 编解码器映射 b'\x14' 来控制字符 U+0014 而不是 U+00B6 (¶)。要支持具有任意 Unicode 字符的文件名，请参阅 Decode PowerShell output possibly containing non-ASCII Unicode characters into a Python string

Felipe Augusto

当 @Aaron Maenpaa's answer 正常工作时，用户 recently asked：

还有更简单的方法吗？ 'fhand.read().decode("ASCII")' [...] 太长了！

您可以使用：

command_stdout.decode()

decode() 有一个 standard argument：

codecs.decode(obj, encoding='utf-8', errors='strict')

使用 'utf-8' 的 .decode() 可能会失败（命令的输出可能使用不同的字符编码，甚至返回不可解码的字节序列）。虽然如果输入是 ascii（utf-8 的子集），那么 .decode() 有效。

Felipe Augusto

如果您应该通过尝试 decode() 获得以下信息：

AttributeError：“str”对象没有属性“decode”

您还可以直接在强制转换中指定编码类型：

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'

LinFelix

如果您遇到此错误：

utf-8 codec can't decode byte 0x8a，

那么最好使用以下代码将字节转换为字符串：

bytes = b"abcdefg"
string = bytes.decode("utf-8", "ignore")

Tshilidzi Mudau

我做了一个清理列表的功能

def cleanLists(self, lista):
    lista = [x.strip() for x in lista]
    lista = [x.replace('\n', '') for x in lista]
    lista = [x.replace('\b', '') for x in lista]
    lista = [x.encode('utf8') for x in lista]
    lista = [x.decode('utf8') for x in lista]

    return lista

实际上，您可以将所有 .strip、.replace、.encode 等调用链接到一个列表推导中，并且只对列表进行一次迭代，而不是对它进行五次迭代。

@TaylorEdmiston 也许它可以节省分配，但操作数量将保持不变。

bers

处理来自 Windows 系统的数据（以 \r\n 行结尾）时，我的答案是

String = Bytes.decode("utf-8").replace("\r\n", "\n")

为什么？用多行 Input.txt 试试这个：

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)

您的所有行尾都将加倍（到 \r\r\n），从而导致额外的空行。 Python 的文本读取函数通常对行尾进行规范化，以便字符串仅使用 \n。如果您从 Windows 系统接收二进制数据，Python 没有机会这样做。因此，

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("\r\n", "\n")
open("Output.txt", "w").write(String)

将复制您的原始文件。

我一直在寻找 .replace("\r\n", "\n") 加法。如果您想正确呈现 HTML，这就是答案。

Peter Mortensen

对于 Python 3，这是一种从 byte 转换为 string 的更安全且 Pythonic 的方法：

def byte_to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes): # Check if it's in bytes
        print(bytes_or_str.decode('utf-8'))
    else:
        print("Object not of byte type")

byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n')

输出：

total 0
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

1）正如@bodangly 所说，类型检查根本不是pythonic。 2）您编写的函数名为“byte_to_str”，这意味着它将返回一个 str，但它只打印转换后的值，并且如果它失败（但不引发异常）。这种方法也是非 Python 的，并且混淆了您提供的 bytes.decode 解决方案。

Boris Verkhovskiy

对于“运行 shell 命令并将其输出作为文本而不是字节”的特定情况，在 Python 3.7 上，您应该使用 subprocess.run 并传入 text=True（以及 {4 } 来捕获输出）

command_result = subprocess.run(["ls", "-l"], capture_output=True, text=True)
command_result.stdout  # is a `str` containing your program's stdout

text 以前称为 universal_newlines，并在 Python 3.7 中进行了更改（嗯，别名）。如果要支持 3.7 之前的 Python 版本，请传入 universal_newlines=True 而不是 text=True

Peter Mortensen

来自 sys — System-specific parameters and functions：

要从/向标准流写入或读取二进制数据，请使用底层二进制缓冲区。例如，要将字节写入标准输出，请使用 sys.stdout.buffer.write(b'abc')。

子进程的管道已经是一个二进制缓冲区。您的答案未能解决如何从生成的 bytes 值中获取字符串值。

Aarav Dave

使用 .decode() 解码。这将解码字符串。传入 'utf-8') 作为内部的值。

Supergamer

字节

m=b'This is bytes'

转换为字符串

方法一

m.decode("utf-8")

或者

m.decode()

方法二

import codecs
codecs.decode(m,encoding="utf-8")

或者

import codecs
codecs.decode(m)

方法三

str(m,encoding="utf-8")

或者

str(m)[1:-1]

结果

'This is bytes'

Leonardo Filipe

def toString(string):    
    try:
        return v.decode("utf-8")
    except ValueError:
        return string

b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))

虽然此代码可能会回答问题，但提供有关如何和/或 为什么 解决问题的附加 context 将提高答案的长期价值。请记住，您正在为将来的读者回答问题，而不仅仅是现在提问的人！请edit您的答案以添加解释，并说明适用的限制和假设。提及为什么这个答案比其他答案更合适也没有什么坏处。

Peter Mortensen

如果要转换任何字节，而不仅仅是转换为字节的字符串：

with open("bytesfile", "rb") as infile:
    str = base64.b85encode(imageFile.read())

with open("bytesfile", "rb") as infile:
    str2 = json.dumps(list(infile.read()))

然而，这不是很有效。它将一张 2 MB 的图片变成 9 MB。

Victor Choy

尝试这个

bytes.fromhex('c3a9').decode('utf-8')

Shubhank Gupta

我们可以使用 bytes.decode(encoding='utf-8', errors='strict') 解码字节对象以生成字符串作为文档。点击here

Python3 示例：

byte_value = b"abcde"
print("Initial value = {}".format(byte_value))
print("Initial value type = {}".format(type(byte_value)))
string_value = byte_value.decode("utf-8")
# utf-8 is used here because it is a very common encoding, but you need to use the encoding your data is actually in.
print("------------")
print("Converted value = {}".format(string_value))
print("Converted value type = {}".format(type(string_value)))

输出：

Initial value = b'abcde'
Initial value type = <class 'bytes'>
------------
Converted value = abcde
Converted value type = <class 'str'>

注意：在 Python3 中，默认编码类型是 utf-8。所以，<byte_string>.decode("utf-8") 也可以写成 <byte_string>.decode()

Ratul Hasan

尝试使用这个；此函数将忽略所有非字符集（如 utf-8）二进制文件并返回一个干净的字符串。它已针对 python3.6 及更高版本进行了测试。

def bin2str(text, encoding = 'utf-8'):
    """Converts a binary to Unicode string by removing all non Unicode char
    text: binary string to work on
    encoding: output encoding *utf-8"""

    return text.decode(encoding, 'ignore')

在这里，该函数将获取二进制文件并对其进行解码（使用 python 预定义字符集将二进制数据转换为字符，并且 ignore 参数忽略二进制文件中的所有非字符集数据，最后返回您想要的 string 值。

如果您不确定编码，请使用 sys.getdefaultencoding() 获取设备的默认编码。

将字节转换为字符串

关注公众号

想领先一步获取最新的外包任务吗？

相似问题

平台

支持

友情链接

联系我们