UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

python python-3.x unicode file-io decode

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:

Traceback (most recent call last):  
   File "SCRIPT LOCATION", line NUMBER, in <module>  
     text = file.read()` 
   File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode  
     return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`

For the same error these solution has helped me , solution of charmap error

See Processing Text Files in Python 3 to understand why you get this error.

For Python > 3.6, set the interpreter option (argument) to include -Xutf8 (that should fix it).

fat

The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.

You specify the encoding when you open the file:

file = open(filename, encoding="utf8")

Cool, I had that problem with some Python 2.7 code that I tried to run in Python 3.4. Latin-1 worked for me!

if you're using Python 2.7, and getting the same error, try the io module: io.open(filename,encoding="utf8")

@1vand1ng0: of course Latin-1 works; it'll work for any file regardless of what the actual encoding of the file is. That's because all 256 possible byte values in a file have a Latin-1 codepoint to map to, but that doesn't mean you get legible results! If you don't know the encoding, even opening the file in binary mode instead might be better than assuming Latin-1.

It is unicode by default, but unicode is not an encoding. regebro.wordpress.com/2011/03/23/…

filename = "C:\Report.txt" with open(filename,encoding ="utf8") as my_file: text = my_file.read() print(text) even after using this I am getting the same error. I have also tried with other encoding but all in vain. In this code I am also using from geotext import GeoText. Please suggest a solution.

Ben

If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)

Many thanks - I will give this a try. There are some invalid characters in parts of files I do not care about.

Warning: This will result in data loss when unknown characters are encountered (which may be fine depending on your situation).

The suggested encoding string should have a dash and therefore it should be: open(csv_file, encoding='utf-8') (as tested on Python3)

Thanks ignoring the errors worked for me

MendelG

Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:

open(filename, 'rb')

where r = reading, b = binary

Perhaps emphasize that the b will produce bytes instead of str data. Like you note, this is suitable if you don't need to process the bytes in any way.

Stevoisiak

As an extension to @LennartRegebro's answer:

If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.

EDIT: (Copied from comment)

A quite popular text editor Sublime Text has a command to display encoding if it has been set...

Go to View -> Show Console (or Ctrl+`)

https://i.stack.imgur.com/TvXZL.png

Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)

https://i.stack.imgur.com/yz8nN.png

Some text editors will provide this information as well. I know that with vim you can get this via :set fileencoding (from this link)

Sublime Text, also -- open up the console and type view.encoding().

alternatively, you can open your file with notepad. 'Save As' and you shall see a drop-down with the encoding used

Olivia Stork

TLDR: Try: file = open(filename, encoding='cp437')

Why? When one uses:

file = open(filename)
text = file.read()

Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.

If such characters are unneeded, one may decide to replace them by question marks, with:

file = open(filename, errors='replace')

Another workaround is to use:

file = open(filename, errors='ignore')

The characters are then left intact, but other errors will be masked too.

A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):

file = open(filename, encoding='cp437')

Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).

Probably you should emphasize even more that randomly guessing at the encoding is likely to produce garbage. You have to know the encoding of the data.

E.Zolduoarrati

Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:

open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')

Godspeed

Sure sir. Roger that. No time wasted. Thank you. Would you like a cup of coffee or a fine wine?

Before you apply that, be sure that you want your 0x90 to be decoded to 'É'. Check b'\x90'.decode('cp437').

Antoni

For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.

Open the file in Notepad++. In the bottom right it will tell you the current file encoding. In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"

Only the viewing encoding is changed in this way. In order to effectively change the file's encoding, change preferences in Notepad++ and create a new document, as shown here: superuser.com/questions/1184299/….

hanna

Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)

and then consider removing it from the file.

I have a web page at tripleee.github.io/8bit/#90 where you can look up the character's value in the various 8-bit encodings supported by Python. With enough data points, you can often infer a suitable encoding (though some of them are quite similar, and so establishing exactly which encoding the original writer used will often involve some guesswork, too).

gabi939

for me encoding with utf16 worked

file = open('filename.csv', encoding="utf16")

Arthur MacMillan

In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).

Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.

This assumes that the source data is UTF-8, which is by no means a given.

SuperStormer

https://i.stack.imgur.com/imf7s.png

Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now

相似问题

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Links

Contact US