Current streak: 143 days / 326 commits - What's this?

Python Character Decoding FTP Files

As part of my Air Canary project, I'm downloading and parsing air quality data from the EPA's Air Now FTP service. Pretty simple stuff, right? The problem that I encountered had to do with character encoding and how to properly convert for storage in a Postgres database.

I first noticed the issue as I populated the database with air monitoring sites using a file that I downloaded via FTP. The download code is pretty simple:

ftp = FTP('ftp.airnowapi.org', ftp_user, ftp_password)
ftp.cwd(dir)
remote_file = 'monitoring_site_locations.dat'
local_file = '/tmp' + remote_file
ftp_cmd = 'RETR ' + remote_file
ftp.retrbinary(ftp_cmd, open(local_file, 'wb').write)
ftp.quit

To load the file into the database, I read the file, loop over the lines, parse each line and put the values into the database via SQLAlchemy.

When I tried to run the load process, I hit an error almost immediately:

sqlalchemy.exc.DataError: (DataError) invalid byte
  sequence for encoding "UTF8": 0x90
 'INSERT INTO
    air_now_monitoring_site (site_name) 
    VALUES (%(site_name)s)'
    {'site_name': 'PARC OC\x90ANIE'}

What happened?

By default, Python assumes ASCII encoding and tries to interpret the file contents as ASCII. This generates a byte sequence that cannot be convert to UTF8 prior to putting into the database.

A quick look at the file on the server in VIM shows the following:

PARC OC<90>ANIE

And if I check the encoding type using the linux file command I get:

$ file /tmp/monitoring_site_locations.dat
/tmp/monitoring_site_locations.dat: Non-ISO extended-ASCII English text, with CRLF line terminators

No help at all. So what is this file encoded as?

This actually required some guess work. At first, I guessed that it was iso-8859-1, since that seems like a fairly standard encoding. And, at firs that seemed to work. I was able to process the file and everything seemed fine, at least until I checked out the data in the database. Here is what I found:

PARC OC\u0090ANIE

Wonderful - what now?

I was viewing the source file in VIM so I became curious about how to set the character encoding in VIM itself so I could see the correct characters. Luckily, I stumbled on this post that gave some tips but also alerted me to a different encoding CP850.

To set an encoding in VIM, the following command can be used:

:e ++enc=[encoding]

After playing around the set encoding command in VIM, I tried it to CP850 and the following character appeared:

--

Almot there, right? What I was actually looking for was this:

--

Not sure if this had something to do with how the text was simply displayed in my terminal, I re-ran the load script adding decoding to the file processing:

with open(filepath) as f:
    for line in f:
        content.append(line.decode('cp850').encode('utf-8'))

then pulled up the website, with the UTF-8 encoding correctly set, and saw the correct string in the browser.

[TO BE CONTINUED]


---
More posts:
CSRF Protection in Angular and Flask
100 commits in 100 days: Recap
Android REST Client Implementation: REST Method
Detecting Android ViewPager Tab Changes
Restoring Fragment State on Re-instantiation