Python3: Reading file containing Unicode characters... How?
by rnturn from LinuxQuestions.org on (#537YF)
I ran into a problem today while trying to read from a file that turned out to have accented characters in it. (In this case an umlauted "o" and an accented "e").
I have a little function (Thank you, O'Reilly) which seems like it'd work for my needs. The trouble is that script is actually aborting on the reads from the file before the function gets a crack at translating anything:
Code:open( datafile, 'r' ) as input
for record in input: <--<< Error occurs here
if string1 in record:
# process it...
if string2 in record:
# etc.Since the variable "record" is never assigned anything, I can never get to any place where I can invoke that hopefully-handy function.
I've tried an alternate means of reading the records from the file:
Code:with open( inf_file, 'r' ) as inf:
records = ( record.strip() for record in inf )
for raw_rec in records: <--<< Now error occurs herewhich gets me past the assigned from the records on disk but now everything blows up when I try to assign any data to "raw_rec".
What's the correct, Pythonesque way to read individual records from a file that may contain a Unicode character here and there? These cases are likely going to few and far between but I'd sure like to make this as generic and flexible as possible.
Note: I've tried opening the file using 'rb' but I'm still getting stuck on that "for" construct when assigning anything to "record" or "raw_rec".
Any hints as to a way out of this dilemma? (Still digging through my local references for clues. Nothing so far.)
Python is pretty nice but dealing with the labyrinth of methods for just trying to read data out of a file -- especially when reading one record at a time -- can be a real headache. This script had been working just fine until Unicode raised its ugly head. :^(
TIA...


I have a little function (Thank you, O'Reilly) which seems like it'd work for my needs. The trouble is that script is actually aborting on the reads from the file before the function gets a crack at translating anything:
Code:open( datafile, 'r' ) as input
for record in input: <--<< Error occurs here
if string1 in record:
# process it...
if string2 in record:
# etc.Since the variable "record" is never assigned anything, I can never get to any place where I can invoke that hopefully-handy function.
I've tried an alternate means of reading the records from the file:
Code:with open( inf_file, 'r' ) as inf:
records = ( record.strip() for record in inf )
for raw_rec in records: <--<< Now error occurs herewhich gets me past the assigned from the records on disk but now everything blows up when I try to assign any data to "raw_rec".
What's the correct, Pythonesque way to read individual records from a file that may contain a Unicode character here and there? These cases are likely going to few and far between but I'd sure like to make this as generic and flexible as possible.
Note: I've tried opening the file using 'rb' but I'm still getting stuck on that "for" construct when assigning anything to "record" or "raw_rec".
Any hints as to a way out of this dilemma? (Still digging through my local references for clues. Nothing so far.)
Python is pretty nice but dealing with the labyrinth of methods for just trying to read data out of a file -- especially when reading one record at a time -- can be a real headache. This script had been working just fine until Unicode raised its ugly head. :^(
TIA...