I want to parse my XML document. So I have stored my XML document as below
class XMLdocs(db.Expando):
id = db.IntegerProperty()
name=db.StringProperty()
content=db.BlobProperty()
Now my below is my code
parser = make_parser()
curHandler = BasketBallHandler()
parser.setContentHandler(curHandler)
for q in XMLdocs.all():
parser.parse(StringIO.StringIO(q.content))
I am getting below error
'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 517, in __call__
handler.post(*groups)
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/base_handler.py", line 59, in post
self.handle()
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 168, in handle
scan_aborted = not self.process_entity(entity, ctx)
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 233, in process_entity
handler(entity)
File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 71, in process
parser.parse(StringIO.StringIO(q.content))
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 136, in characters
print ch
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
print
. Don't use print in a WSGI app!
The actual best answer for this problem depends on your environment, specifically what encoding your terminal expects.
The quickest one-line solution is to encode everything you print to ASCII, which your terminal is almost certain to accept, while discarding characters that you cannot print:
print ch #fails
print ch.encode('ascii', 'ignore')
The better solution is to change your terminal's encoding to utf-8, and encode everything as utf-8 before printing. You should get in the habit of thinking about your unicode encoding EVERY time you print or read a string.
Just putting .encode('utf-8')
at the end of object will do the job in recent versions of Python.
3.x
, or also 2.7
?
It seems you are hitting a UTF-8 byte order mark (BOM). Try using this unicode string with BOM extracted out:
import codecs
content = unicode(q.content.strip(codecs.BOM_UTF8), 'utf-8')
parser.parse(StringIO.StringIO(content))
I used strip
instead of lstrip
because in your case you had multiple occurences of BOM, possibly due to concatenated file contents.
s
which produces the error with s = unicode(s.strip(codecs.BOM_UTF8), 'utf-8')
. s
refers to the name of your strings.
lstrip
with strip
.
This worked for me:
from django.utils.encoding import smart_str
content = smart_str(content)
The problem according to your traceback is the print
statement on line 136 of parseXML.py
. Unfortunately you didn't see fit to post that part of your code, but I'm going to guess it is just there for debugging. If you change it to:
print repr(ch)
then you should at least see what you are trying to print.
The problem is that you're trying to print an unicode character to a possibly non-unicode terminal. You need to encode it with the 'replace
option before printing it, e.g. print ch.encode(sys.stdout.encoding, 'replace')
.
An easy solution to overcome this problem is to set your default encoding to utf8. Follow is an example
import sys
reload(sys)
sys.setdefaultencoding('utf8')
ascii
to remain the default. It is why setdefaultencoding
is not normally available without the reload
trick.
Success story sharing