When piping the output of a Python program, the Python interpreter gets confused about encoding and sets it to None. This means a program like this:
# -*- coding: utf-8 -*-
print u"åäö"
will work fine when run normally, but fail with:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
when used in a pipe sequence.
What is the best way to make this work when piping? Can I just tell it to use whatever encoding the shell/filesystem/whatever is using?
The suggestions I have seen thus far is to modify your site.py directly, or hardcoding the defaultencoding using this hack:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
print u"åäö"
Is there a better way to make piping work?
chcp 65001
before executing your script. This can have issues, but it often helps, and doesn't require a lot of typing (less than set PYTHONIOENCODING=utf_8
).
setx PYTHONENCODING utf-8
to make it permanent if you want to save typing.
Your code works when run in an script because Python encodes the output to whatever encoding your terminal application is using. If you are piping you must encode it yourself.
A rule of thumb is: Always use Unicode internally. Decode what you receive, and encode what you send.
# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')
Another didactic example is a Python program to convert between ISO-8859-1 and UTF-8, making everything uppercase in between.
import sys
for line in sys.stdin:
# Decode what you receive:
line = line.decode('iso8859-1')
# Work with Unicode internally:
line = line.upper()
# Encode what you send:
line = line.encode('utf-8')
sys.stdout.write(line)
Setting the system default encoding is a bad idea, because some modules and libraries you use can rely on the fact it is ASCII. Don't do it.
First, regarding this solution:
# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')
It's not practical to explicitly print with a given encoding every time. That would be repetitive and error-prone.
A better solution is to change sys.stdout
at the start of your program, to encode with a selected encoding. Here is one solution I found on Python: How is sys.stdout.encoding chosen?, in particular a comment by "toka":
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
def myprint(unicodeobj): print unicodeobj.encode('utf-8')
-- you automatically detect terminal encoding by inspecting sys.stdout.encoding
, but you should consider the case where it is None
(i.e. when redirecting output to a file) so you need a separate function anyway.
You may want to try changing the environment variable "PYTHONIOENCODING" to "utf_8". I have written a page on my ordeal with this problem.
Tl;dr of the blog post:
import sys, locale, os
print(sys.stdout.encoding)
print(sys.stdout.isatty())
print(locale.getpreferredencoding())
print(sys.getfilesystemencoding())
print(os.environ["PYTHONIOENCODING"])
print(chr(246), chr(9786), chr(9787))
gives you
utf_8
False
ANSI_X3.4-1968
ascii
utf_8
ö ☺ ☻
sys.stdout = codecs.getwriter(encoding)(sys.stdout)
. This can be done from within the python program, so the user is not forced to set an env variable.
PYTHONIOENCODING
does work. How bytes are interpreted as a text is defined by user environment. Your script shouldn't be assuming and dictate the user environment what character encoding to use. If Python doesn't pick up the settings automatically then PYTHONIOENCODING
can be set for your script. You shouldn't need it unless the output is redirected to a file/pipe.
export PYTHONIOENCODING=utf-8
do the job, but can't set it on python itself ...
what we can do is verify if isn't setting and tell the user to set it before call script with :
if __name__ == '__main__':
if (sys.stdout.encoding is None):
print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
exit(1)
Update to reply to the comment: the problem just exist when piping to stdout . I tested in Fedora 25 Python 2.7.13
python --version
Python 2.7.13
cat b.py
#!/usr/bin/env python
#-*- coding: utf-8 -*-
import sys
print sys.stdout.encoding
running ./b.py
UTF-8
running ./b.py | less
None
sys.stdout.encoding
is automatically set based on the LC_CTYPE
locale value.
I had a similar issue last week. It was easy to fix in my IDE (PyCharm).
Here was my fix:
Starting from PyCharm menu bar: File -> Settings... -> Editor -> File Encodings, then set: "IDE Encoding", "Project Encoding" and "Default encoding for properties files" ALL to UTF-8 and she now works like a charm.
Hope this helps!
An arguable sanitized version of Craig McQueen's answer.
import sys, codecs
class EncodedOut:
def __init__(self, enc):
self.enc = enc
self.stdout = sys.stdout
def __enter__(self):
if sys.stdout.encoding is None:
w = codecs.getwriter(self.enc)
sys.stdout = w(sys.stdout)
def __exit__(self, exc_ty, exc_val, tb):
sys.stdout = self.stdout
Usage:
with EncodedOut('utf-8'):
print u'ÅÄÖåäö'
I just thought I'd mention something here which I had to spent a long time experimenting with before I finally realised what was going on. This may be so obvious to everyone here that they haven't bothered mentioning it. But it would've helped me if they had, so on that principle...!
NB: I am using Jython specifically, v 2.7, so just possibly this may not apply to CPython...
NB2: the first two lines of my .py file here are:
# -*- coding: utf-8 -*-
from __future__ import print_function
The "%" (AKA "interpolation operator") string construction mechanism causes ADDITIONAL problems too... If the default encoding of the "environment" is ASCII and you try to do something like
print( "bonjour, %s" % "fréd" ) # Call this "print A"
You will have no difficulty running in Eclipse... In a Windows CLI (DOS window) you will find that the encoding is code page 850 (my Windows 7 OS) or something similar, which can handle European accented characters at least, so it'll work.
print( u"bonjour, %s" % "fréd" ) # Call this "print B"
will also work.
If, OTOH, you direct to a file from the CLI, the stdout encoding will be None, which will default to ASCII (on my OS anyway), which will not be able to handle either of the above prints... (dreaded encoding error).
So then you might think of redirecting your stdout by using
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
and try running in the CLI piping to a file... Very oddly, print A above will work... But print B above will throw the encoding error! The following will however work OK:
print( u"bonjour, " + "fréd" ) # Call this "print C"
The conclusion I have come to (provisionally) is that if a string which is specified to be a Unicode string using the "u" prefix is submitted to the %-handling mechanism it appears to involve the use of the default environment encoding, regardless of whether you have set stdout to redirect!
How people deal with this is a matter of choice. I would welcome a Unicode expert to say why this happens, whether I've got it wrong in some way, what the preferred solution to this, whether it also applies to CPython, whether it happens in Python 3, etc., etc.
"fréd"
is a byte sequence and not a Unicode string, so the codecs.getwriter
wrapper will leave it alone. You need a leading u
, or from __future__ import unicode_literals
.
I could "automate" it with a call to:
def __fix_io_encoding(last_resort_default='UTF-8'):
import sys
if [x for x in (sys.stdin,sys.stdout,sys.stderr) if x.encoding is None] :
import os
defEnc = None
if defEnc is None :
try:
import locale
defEnc = locale.getpreferredencoding()
except: pass
if defEnc is None :
try: defEnc = sys.getfilesystemencoding()
except: pass
if defEnc is None :
try: defEnc = sys.stdin.encoding
except: pass
if defEnc is None :
defEnc = last_resort_default
os.environ['PYTHONIOENCODING'] = os.environ.get("PYTHONIOENCODING",defEnc)
os.execvpe(sys.argv[0],sys.argv,os.environ)
__fix_io_encoding() ; del __fix_io_encoding
Yes, it's possible to get an infinite loop here if this "setenv" fails.
I ran into this problem in a legacy application, and it was difficult to identify where what was printed. I helped myself with this hack:
# encoding_utf8.py
import codecs
import builtins
def print_utf8(text, **kwargs):
print(str(text).encode('utf-8'), **kwargs)
def print_utf8(fn):
def print_fn(*args, **kwargs):
return fn(str(*args).encode('utf-8'), **kwargs)
return print_fn
builtins.print = print_utf8(print)
On top of my script, test.py:
import encoding_utf8
string = 'Axwell Λ Ingrosso'
print(string)
Note that this changes ALL calls to print to use an encoding, so your console will print this:
$ python test.py
b'Axwell \xce\x9b Ingrosso'
On Windows, I had this problem very often when running a Python code from an editor (like Sublime Text), but not if running it from command-line.
In this case, check your editor's parameters. In the case of SublimeText, this Python.sublime-build
solved it:
{
"cmd": ["python", "-u", "$file"],
"file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
"selector": "source.python",
"encoding": "utf8",
"env": {"PYTHONIOENCODING": "utf-8", "LANG": "en_US.UTF-8"}
}
Since Python 3.7, we can use Python UTF-8 Mode, by using command line option -X utf8:
python -X utf8 testzh.py
The script testzh.py contains
print("Content-type: text/html; charset=UTF-8\n")
print("地球你好!")
To set Windows 10 Internet Service IIS as CGI Script handler,
We set Executable as this:
"C:\Program Files\Python39\python.exe" -X utf8 %s
https://i.stack.imgur.com/9ezXd.png
This works for Chinese Ideograms as expected on Browser Microsoft.Edge like this screenshot: Otherwise, error occurs.
https://i.stack.imgur.com/dZrFs.png
Please see https://docs.python.org/3/library/os.html#utf8-mode
I'm surprised this answer has not been posted here yet
Since Python 3.7 you can change the encoding of standard streams with reconfigure(): sys.stdout.reconfigure(encoding='utf-8') You can also modify how encoding errors are handled by adding an errors parameter.
https://stackoverflow.com/a/52372390/15675011
Success story sharing
sys.stdout
seems like a more pleasant way.PYTHONIOENCODING
if you are redirecting script's stdout in Python 2.