python – the reality tunnel

Introducing Clarify

state_summary_page__KY

Source has a blog post that Derek Willis and I wrote about Clarify, a small Python library for parsing data from Clarity Election Night Reporting (ENR) systems.

This was developed during OpenNews’ Elections Code Convening, which was really fun, productive and a compelling alternative to hackathons as a model for collaborative hacking.

Illinois in the ICE Age is a data visualization project that I made along with Jimmie Glover, Ruth Lopez and @taratc for Chicago MigraHack. The visualization is based on a data set provided by the Transactional Records Access Clearinghouse that describes the trajectory of 915 migrants detained in Illinois from the time they were taken into custody until they left ICE custody in November and December 2012.

We used Google Spreadsheets’ pivot tables feature to explore the data and develop questions and insights. This also provided a good way to double check programatically computed statistics. The map animation is built using D3. I sucked the data set into a local SQLite database using the Peewee ORM and wrote some Python scripts to transform the data and export it as JSON to make it easier to visualize (especially the day-by-day updates).

The project won the “Best data visualization team project” and “Audience Favorite” awards.

Floodlight

Floodlight is a web-based platform for telling community stories from people, places and organizations around the Denver metropolitan area. The project was originally supported by The Piton Foundation and funded in part by a grant received as part of the Knight Foundation Community Information Challenge.

Role

Lead developer

Details

Since many of the stories told on Floodlight are about work being done in particular Denver, Colorado neighborhoods, authors needed to be able to tag their stories with boundary geographies (cities, neighborhoods, zip codes) or provide addresses of specific places, which are automatically geocoded and associated with boundary geographies.

For this project, I relied on the excellent spatial data support of the PostGIS database and abstractions offered by the Django Frameworks ORM to implement a data model and create scripts to easily load new boundary geographies from shapefiles and define their relationships with other boundaries.

Users searching for stories in their community can use a faceted browsing interface that not only filters based on address and boundary geography, but also other taxonomies. In order to integrate these different types of filters, I used the Solr search server. On the front end, the Backbone framework provides an interface to the faceted browse and the Leaflet mapping library is used to display boundary geographies and story markers based on the user’s filters.

The application is powered by a RESTful API that is consumed by a Backbone-based story builder.

In addition to the web platform, the project also included a great capacity-building component. Team members conducted “story raising” events that trained community members in different digital storytelling skills and a “story navigator” worked with community groups to discover and tell stories using the platform.

More Information

GitHub repository
Floodlight: New digital tool helps Colorado organizations tell their stories (Knight Foundation write-up of project)

Screenshots

Making sure South migrations get run when using Django’s create_test_db()

I’ve been experimenting with using Lettuce for a project. When not using Django’s test runner, you can use the framework’s test database hooks by calling create_test_db() (see the Django docs for create_test_db()) from a method in your terrain. Django Full Stack Testing and BDD with Lettuce and Splinter is a great resource for seeing how to get up and running. But, I was having a terrible time because create_test_db() was throwing an exception because it was trying to run the flush management command to flush data on a table that hadn’t yet been created. According to South’s documentation, “South’s syncdb command will also apply migrations if it’s run in non-interactive mode, which includes when you’re running tests.”

While South’s syncdb command was getting executed by create_test_db(), the option that tells the command to run the migrations after running syncdb, wasn’t getting properly set. It turns out there is a (not very well documented) workaround, you have to call south.management.commands.patch_for_test_db_setup() before your call to create_test_db().

So, your terrain.py might look something like this:

from lettuce import before, after, world
from django.db import connection
from django.test.utils import setup_test_environment, teardown_test_environment
from south.management.commands import patch_for_test_db_setup
from splinter.browser import Browser

@before.runserver
def setup(server):
# Force running migrations after syncdb.
# syncdb gets run automatically by create_test_db(), and
# South's syncdb (that runs migrations after the default
# syncdb) normally gets called in a test environment, but
# apparently not when calling create_test_db().
# So, we have to use this monkey patched version.
patch_for_test_db_setup()
connection.creation.create_test_db()
setup_test_environment()
world.browser = Browser('webdriver.firefox')

# ...

Installing numpy into a virtualenv

I ran into some problems installing numpy in a virtualenv on Ubuntu 10.10. I’m not sure what the root cause of the problem was, but my environment is a little weird in that I have a number of different python versions installed and virtualenvs using different versuons of python. The setup for numpy wasn’t finding global environment configuration variables from the call to sysconfig.get_config_vars. I ended up fixing my issues by copying the global Makefile and pyconfig.h into the virtualenv:

$ mkdir -p /home/ghing/.virtualenvs/foodgenius-analytics/local/lib/python2.7/config/
$ cp /usr/lib/python2.7/config/Makefile /home/ghing/.virtualenvs/foodgenius-analytics/local/lib/python2.7/config/
$ mkdir -p /home/ghing/.virtualenvs/foodgenius-analytics/local/include/python2.7/
$ cp /usr/include/python2.7/pyconfig.h /home/ghing/.virtualenvs/foodgenius-analytics/local/include/python2.7/

Twitter interface to CTA bus tracker

About

This is a Twitter (and hopefully, later, a plain old SMS) interface to the CTA Bus Tracker so those of us with simple mobile phones can find out information about our busses.

Ever since I moved to Chicago, I’ve been riding the Chicago Transit Authority (CTA) busses almost every day.Â I’ve missed my bus more than a few times too.Â CTA has a Bus Tracker website that works from computers and mobile devices with web browsers (there’s iPhone apps as well), but my phone only does SMS.

This is a work in progress as I’m coding it mostly on the bus on the way to and from work.

Usage

In order to do anything, you have to follow @ctabt.

The best way to show how to use the system is through some examples:

Get help

d ctabt help

Get all stop names and IDs for Eastbound route 77

d ctabt 77 east stops

Get the name for Eastbound route 77 stops at Sheffield

d ctabt 77 east stops shef

Get upcoming busses for a particular stop by name

d ctabt 77 east shef

Get upcoming busses for a particular stop by ID

d ctabt 77 east 9288

You can abbreviate most parts of the commands

d ctabt 77 e s shef

Related Work

Implementing this was made possible by Harper Reed’s awesome opening of the Bus Tracker API.

CTA Tweet Feed has Twitter/RSS feeds of authoritative and user-generated updates about riding the CTA.

Really, It’s Worth It

This was originally posted on the Local Fourth blog as part of my participation in a community media innovation project at the Medill School of Journalism.

You’re in the middle of a big project with tight deadlines. Parts of your infrastructure are a little, well, jenky. Do you take the time to make things cleaner and more coherent, or do you focus on coding, hoping that your stack holds together until you have time to clean it up? Will a new tool pay off or is it just a distraction from perhaps more tedious, but more crucial work that needs to be done.

This was my situation early this week when I started looking at our deployment procedure when moving developing on our workstations to making our work public on our webhost. I was working on a bug where things that worked on our development machines weren’t working on the webhost. Instead of just setting up a new instance for testing on our webhost, I decided to invest the time in exploring a new-to-me tool called Fabric.

Fabric is a Python library and tool for scripting commands to be run on a remote server over SSH. You can store it in the root of your Django or other Python project and run tasks on the remote server. For instance, to build up a new staging instance on our webserver, I type ‘fab staging mkinstance’ on my shell. In this example, staging and mkinstance are just tasks that you’ve defined as Python functions. staging is a task that sets context variables about this particular instance, such as the home directory where the other tasks will execute on the remote server. mkinstance just calls other tasks to create a new virtualenv, install required Python packages, download the most recent version of the source from git and more. I may go into the details of our Fabric configuration in the future, but, more importantly, I want to explain why I think exploring this tool to improve our deployment was a good way to spend my time early this week.

Avoid rm -rf . “Oh #$%!(*&!!”

I would say that even savvy system administrators who know better, who will lecture their junior admins about what not to do at great length, sometimes hop in a shell to run a quick command only to carelessly wreak disaster. I recently caught myself typing commands into shells on the wrong host. There had to be a better way. Fabric modules compel me to explicitly type the action that I intend to take and the instance I want to take it on. fab staging destroyeverything has a different resonance to ssh somehost.com && cd /home/ghing/webapps/ && rm -rf myproject.

A deployment reasoning tool

Automating your deployment process in a script helps you break down the deployment process. What are the steps that need to be taken? What order should they be run in? What are the dependencies between the steps? Writing these things out in code helps identify redundancies or potential problems in the process.

Code as documentation

An artist who taught game development to high school students told me that she introduced students to the idea of programming by having them work in pairs and take turns writing pseudocode to instruct their partner about how to move around in a space. I’d imagine one of the big takeaways is the difference between code and natural language. The latter can make for some beautiful journalism, but it can make describing a process convoluted. After I finished our Fabric module, I wondered how I would have written free form paragraphs to describe our deploy process, or explain them out loud. Even with few comments, its clear what’s happening in the deploy process. In around the same time it would take to write a step-by-step description of the process, I have one in code that can also do all the steps for me.

Paying It Forward

This was originally posted on the Local Fourth blog as part of my participation in a community media innovation project at the Medill School of Journalism.

I’m finding the word community increasingly confusing, especially when navigating the world of hyperlocal publishing. When someone says community, do they mean community like the city of Evanston, or the city’s West Side neighborhood, or a block club or church. Or, do they mean the community of users of a particular site? When do these groups intersect, when are they too disparate? The 2010 Knight News Challenge goes as far as defining a specific Community category for entries:

Community: Seeks groundbreaking technologies that support news and
information specifically within defined geographic areas. This is designed to
jump-start work on technologies and approaches that haven’t arrived yet.
Unlike the first three categories, sub-
missions in this area must be tested in a geographically designated community.

But, in a Sept. 20 post announcing the 2010 challenge, the poster wrote “I think of this as our io9 category,” referring to a Gawker Media-run science-fiction and popular culture site. Perhaps the poster was referring to the future-focused voice of the site, but it also surfaces the possibility that people may increasingly identify with communities and person-to-person interactions that aren’t geographically bound.

In looking at strong, geographically disparate online communities, groups of people engaging around free/libre/opensource software, or FLOSS projects are one of the most compelling. While they can exhibit the same segregation or bickering of physical communities, they can also be a model for people coming together to build something that serves a clear need. The way in which many projects are firmly grounded in utility and the way in which similar projects seem to sustain themselves not by competing but by understanding how their project does a job that’s different than other software is a lesson that media organizations, particularly in the hyperlocal space, would do well to learn.

FLOSS projects also complicate traditional notions of sustainability. While many projects have found ways to sustain themselves financially, either through donations, sponsorship or by incredible use of volunteer time coding, documenting and providing help and training for the project, FLOSS projects tend to put utility ahead of commercial viability. Making technology that serves a need and remains relevant and responsive to changing needs and to feedback from users is as important to the sustainability of the project as the dollars and cents.

We make use of a lot of FLOSS for implementing the technological part of the innovation project. While there are lots of ways that we could give back for this technology that is so useful us as developers (Palintir, a Chicago-based web development shop that specializes in sites based on the Drupal content management system, for instance, contributes code that they use to develop new features for their clients back to the larger Drupal community), the tight time constraints of graduate school and a rapid project mean that dollars are the best way that I can give back to these projects.

Even though most of the tools that I use to make technology are available free of cost, paying something for them helps me think of how I value the tools for this project. I’ve decided to donate the amount of money that I spend each week on a common indulgence during this project, going out for lunch with other team members, to some of the FLOSS tools that I’ve used the most in the last few weeks.

Python – most of the code in this project is written in this language. It’s flexible, easy to learn, has a large number of useful contributed libraries and is very readable making it easy to understand someone else’s code. Donate to the Python Software Foundation.

jQuery – If the back end of the project is written in Python, the front end is highly dependent on the jQuery javascript framework. JQuery makes it easier to implement some of the rich user interactions that people have come to expect on the web. Donate to the jQuery project.

Django – Django is a Python web framework that has its roots in the newsroom. . The first time I used the framework, I was amazed at how it streamlined the most tedious aspects of web development. When I’m curious about how to do something in the framework, I often discover that there’s an elegant approach provided in the framework along with clear documentation. Donate to the Django Software Foundation.

Vim – I was compelled to learn to use this editor when I started at my first tech job at a regional Internet service provider. The network administrator said that it was important to learn vi (Vim, an enhanced version of the classic UNIX editor, stands for vi improved) because you could be assured that it would be available on any UNIX system that you found yourself poking around. While the navigation of the program, which is keystroke heavy, seemed unintuitive at first, once I got used to it, the lightweight but highly customizable and extensible editor felt like it was designed just for me. Rather than asking for donations to sustain the project, Vim’s lead developer solicits donations for a charity that supports children in Uganda.

Firebug – I don’t know how I wrote programs for the web before Firebug. This Firefox extension helps me understand and tweak the HTML and CSS of a design and also see what is going on behind the scenes with Javascript errors and AJAX requests. Donate to the Firebug project.

Finding duplicate records in a books to prisoners database application

High on my list of neglected tech. projects is the Testament books to prisoners database web application.Â This is the database program that projects like the Midwest Pages to Prisoners Project use to track packages sent and returned and books requested in the hopes avoiding delays in delivering books to incarcerated people and to provide metrics that grant providers like.

One of the design challenges has to do with duplicate records.Â Recipients of books are identified by their state/federal department of correction (DOC) number (if they’re in a state or federal prison – most jails don’t use ID numbers), their state of incarceration and their name.Â I assume that the database was designed originally to minimize barriers for the book project volunteers so both the name and DOC# are free text fields.Â Javascript is used to match existing records based on the DOC#, but there is still a large possibility for duplicate records.

The reason for duplicate records is that both the person writing to request books and the volunteer may list their name and/or DOC# inconsistently.Â For instance, the state may store the DOC# in their database as A-123456 but the incarcerated person may write it as A123456 A-123-456 or just 123456.Â Volunteers who don’t know about this and aren’t careful may not check beforehand for an existing record.

This is probably preventable through more sophisticated validation, but we still need a way to find duplicates in the existing records.Â As this application is written in the Django framework, I want to try to use the Django API to find matches.

At first thought, it seems like I will have to iterate through each inmate record and check if there is a duplicate record.Â This seems pretty slow, but I can’t think of a better way to do this.Â At this point, there aren’t so many records that this approach will fail, but it would be nice to do something slicker.

The other problem is how to match a duplicate.Â One approach might be to build a regexp for the DOC# (for instance, match either the first character or omit it, allow dashes or spaces between all characters, …) and then use the iregexp field lookup to try to find matches. One challenge with this is that the current Testament codebase is using Django 0.97 (I think) and iregexp is only available starting in 1.0.Â Maybe it’s time we updated our code anyway.

There is also the Python difflib module that can compute deltas between strings.Â However, it seems like this would slow things down even further because you would have to load each inmate object and then use difflib to compare the DOC#s.Â I assume that the previous approach would be faster because the regexp matching happens at the database level.

Custom django-admin commands and PYTHONPATH

Note to self:Â If you want to make your custom django-admin command work, you need to have your django app in your PYTHONPATH.

Tag: python