ChatGPT解决这个技术问题 Extra ChatGPT

UnicodeEncodeError: 'charmap' codec can't encode characters

I'm trying to scrape a website, but it gives me an error.

I'm using the following code:

import urllib.request
from bs4 import BeautifulSoup

get = urllib.request.urlopen("https://www.website.com/")
html = get.read()

soup = BeautifulSoup(html)

print(soup)

And I'm getting the following error:

File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>

What can I do to fix this?


t
twasbrillig

I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:

with open(fname, "w") as f:
    f.write(html)

with this:

with open(fname, "w", encoding="utf-8") as f:
    f.write(html)

If you need to support Python 2, then use this:

import io
with io.open(fname, "w", encoding="utf-8") as f:
    f.write(html)

If your file is encoded in something other than UTF-8, specify whatever your actual encoding is for encoding.


In mac(python 3) works perfectly with just open without encoding, but in windows(w10, python3) is not an option. Just works in that way, with encoding="utf-8" param.
Thank you. It worked for me, i was working with xml files and writing the result of xml.toprettyxml() in a new file
This should be the accepted answer because it will eventually write a string to the output, and not a string representation of bytes.
OP requested to read the file however, not write the file. The issue seems to be console-related.
The comment by @EcksDee petains to an earlier version of this answer. The current version is correct; the io wrapper is necessary for Python 2, where the regular open function did not permit you to specify an encoding.
t
twasbrillig

I fixed it by adding .encode("utf-8") to soup.

That means that print(soup) becomes print(soup.encode("utf-8")).


don't hardcode the character encoding of your environment (e.g., console) inside your script, print Unicode directly instead
This is just printing the repr of a bytes object, which will print as a mess of \x sequences if there's a lot of UTF-8 encoded text. I recommend using win_unicode_console, as @J.F.Sebastian suggests.
I used the above solution but sill getting issues: class MyStreamListener(tweepy.StreamListener): def on_status(self, status): print(str(status.encode("utf-8"))) UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position 87: character maps to
This makes it print out b'\x02x\xc2\xa9' (a bytes object) instead
print(soup.encode("utf-8")) worked for me, but before that I had to also add with open("f_name", encoding="utf-8") as f: soup = BeautifulSoup(f, "html.parser")
M
MilkyWay90

In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)

Replacing this line:

with open('filename', 'w') as f:

With this:

with open('filename', 'w', encoding='utf-8') as f:

The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.


print(soup) return \xd0\xbf\xd0\xbe\xd0\xb6\xd0\xb0\xd0\xbb\xd1\x83\xd0\xb9\xd
@CoffeeinTime That looks like UTF-16 erroneously converted to some 8-bit encoding, or possibly using Pyhon 2. The string you show is truncated, but it seems to begin with "뿐뻐뛐냐믐菑말" (I don't read Korean so I have no idea if that makes any sense). Demo: ideone.com/092Jnk
V
Voy
set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8

You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.

Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):

sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')

Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:

set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252

This is perfect; I was getting this error while using the Python Debugger (pdb) on a Windows system looking at source code that used utf-8 and had lots of emoji in it. Every time I did a "list" command to see where I was, the "charmap" error appeared. Settings these two environment variables made my debugging as smooth as silk.
sys.stdin.reconfigure is invalid on Python 3.9.0, it throws AttributeError: 'StdInputFile' object has no attribute 'reconfigure'
On Windows 10, using GIT BASH, setting the env variables mentioned above did NOT work, however, setting the two lines in the actual python code file DID work: sys.stdin.reconfigure(encoding='utf-8') sys.stdout.reconfigure(encoding='utf-8')
@Suncatcher Try to run this Python script in a different IDE
@PetrL. why I should use IDE at all? all valid Python commands should be interpretable in Python Shell, otherwise they are not valid
S
Suraj Rao

While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production

import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
    f.write(resp.text)

When I added encoding="utf-8" with the open command it saved the file with the correct response

with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
    f.write(resp.text)

P
Pardhu Gopalam

Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.

soup.encode("utf-8")

If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")

with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:


M
MilkyWay90

For those still getting this error, adding encode("utf-8") to soup will also fix this.

soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)

soup is no longer a BeautifulSoup object after you do this so it cannot be manipulated or searched
t
tripleee

There are multiple aspects to this problem. The fundamental question is which character set you want to output into. You may also have to figure out the input character set.

Printing (with either print or write) into a file with an explicit encoding="..." will translate Python's internal Unicode representation into that encoding. If the output contains characters which are not supported by that encoding, you will get an UnicodeEncodeError. For example, you can't write Russian or Chinese or Indic or Hebrew or Arabic or emoji or ... anything except a restricted set of some 200+ Western characters to a file whose encoding is "cp1252" because this limited 8-bit character set has no way to represent these characters.

Basically the same problem will occur with any 8-bit character set, including nearly all the legacy Windows code pages (437, 850, 1250, 1251, etc etc), though some of them support some additional script in addition to or instead of English (1251 supports Cyrillic, for example, so you can write Russian, Ukrainian, Serbian, Bulgarian, etc). An 8-bit encoding has only a maximum of 256 character codes and no way to represent a character which isn't among them.

Perhaps now would be a good time to read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

On platforms where the terminal is not capable of printing Unicode (only Windows these days really, though if you're into retrocomputing, this problem was also prevalent on other platforms in the previous millennium) attempting to print Unicode strings can also produce this error, or output mojibake. If you see something like Héllö instead of Héllö, this is your issue.

In short, then, you need to know:

What is the character set of the page you scraped, or the data you received? Was it correctly scraped? Did the originator correctly identify its encoding, or are you able to otherwise obtain this information (or guess it)? Some web sites incorrectly declare a different character set than the page actually contains, some sites have incorrectly configured the connection between the web server and a back-end database. See e.g. scrape with correct character encoding (python requests + beautifulsoup) for a more detailed example with some solutions.

What is the character set you want to write? If printing to the screen, is your terminal correctly configured, and is your Python interpreter configured identically? Perhaps see also How to display utf-8 in windows console

If you are here, probably the answer to one of these questions is not "UTF-8". This is increasingly becoming the prevalent encoding for web pages, too, though the former standard was ISO-8859-1 (aka Latin-1) and more recently Windows code page 1252.

Going forward, you basically want all your textual data to be Unicode, outside of a few fringe use cases. Generally, that means UTF-8, though on Windows (or if you need Java compatibility), UTF-16 is also vaguely viable, albeit somewhat cumbersome. (There are several other Unicode serialization formats, which may be useful in specialized circumstances. UTF-32 is technically trivial, but takes up a lot more memory; UTF-7 is used in a few network protocols where 7-bit ASCII is required for transport.) Perhaps see also https://utf8everywhere.org/

Naturally, if you are printing to a file, you also need to examine that file using a tool which can correctly display it. A common pilot error is to open the file using a tool which only displays the currently selected system encoding, or one which tries to guess the encoding, but guesses wrong. Again, a common symptom when viewing UTF-8 text using Windows code page 1252 would result, for example, in Héllö displaying as Héllö.

If the encoding of character data is unknown, there is no simple way to automatically establish it. If you know what the text is supposed to represent, you can perhaps infer it, but this is typically a manual process with some guesswork involved. (Automatic tools like chardet and ftfy can help, but they get it wrong some of the time, too.)

To establish which encoding you are looking at, it can be helpful if you can identify the individual bytes in a character which isn't displayed correctly. For example, if you are looking at H\x8ell\x9a but expect it to represent Héllö, you can look up the bytes in a translation table. I have published one such table at https://tripleee.github.io/8bit where you can see that in this example, it's probably one of the legacy Mac 8-bit character sets; with more data points, perhaps you can narrow it down to just one of them (and if not, any one of them will do in practice, since all the code points you care about map to the same Unicode characters).

Python 3 on most platforms defaults to UTF-8 for all input and output, but on Windows, this is commonly not the case. It will then instead default to the system's default encoding (still misleadingly called "ANSI code page" in some Microsoft documentation), which depends on a number of factors. On Western systems, the default encoding out of the box is commonly Windows code page 1252. (Earlier Python versions had somewhat different expectations, and in Python 2, the internal string representation was not Unicode.)

If you are on Windows and write UTF-8 to a file, maybe specify encoding="utf-8-sig" which adds a BOM sequence at the beginning of the file. This is strictly speaking not necessary or correct, but some Windows tools need it to correctly identify the encoding.

Several of the earlier answers here suggest blindly applying some encoding, but hopefully this should help you understand how that's not generally the correct approach, and how to figure out which encoding to use.


K
Karim Sherif

if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252' example:

csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))

关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now