Finding duplicate records in a books to prisoners database application

High on my list of neglected tech. projects is the Testament books to prisoners database web application.  This is the database program that projects like the Midwest Pages to Prisoners Project use to track packages sent and returned and books requested in the hopes avoiding delays in delivering books to incarcerated people and to provide metrics that grant providers like.

One of the design challenges has to do with duplicate records.  Recipients of books are identified by their state/federal department of correction (DOC) number (if they’re in a state or federal prison – most jails don’t use ID numbers), their state of incarceration and their name.  I assume that the database was designed originally to minimize barriers for the book project volunteers so both the name and DOC# are free text fields.  Javascript is used to match existing records based on the DOC#, but there is still a large possibility for duplicate records.

The reason for duplicate records is that both the person writing to request books and the volunteer may list their name and/or DOC# inconsistently.  For instance, the state may store the DOC# in their database as A-123456 but the incarcerated person may write it as A123456 A-123-456 or just 123456.  Volunteers who don’t know about this and aren’t careful may not check beforehand for an existing record.

This is probably preventable through more sophisticated validation, but we still need a way to find duplicates in the existing records.  As this application is written in the Django framework, I want to try to use the Django API to find matches.

At first thought, it seems like I will have to iterate through each inmate record and check if there is a duplicate record.  This seems pretty slow, but I can’t think of a better way to do this.  At this point, there aren’t so many records that this approach will fail, but it would be nice to do something slicker.

The other problem is how to match a duplicate.  One approach might be to build a regexp for the DOC# (for instance, match either the first character or omit it, allow dashes or spaces between all characters, …) and then use the iregexp field lookup to try to find matches. One challenge with this is that the current Testament codebase is using Django 0.97 (I think) and iregexp is only available starting in 1.0.  Maybe it’s time we updated our code anyway.

There is also the Python difflib module that can compute deltas between strings.  However, it seems like this would slow things down even further because you would have to load each inmate object and then use difflib to compare the DOC#s.  I assume that the previous approach would be faster because the regexp matching happens at the database level.

Importing relationships into CiviCRM

As part of my work at the Center for Research Libraries, I am investigating different Constituent Resource Management (CRM) systems.  One of the options is CiviCRM, a popular FLOSS CRM.  As CRL is, in large part, a membership organization, I wanted to see if it was possible to represent the basic information that we keep about our member organizations in the CRM.  I found that data entry through the web interface was pretty slow, so I wanted to experiment with CiviCRM’s contact import capabilities.

CiviCRM lets you define multiple, arbitrary relationships between contacts. This is how we can connect individual contacts with their institution (for instance the Librarian Councilor or Purchase Proposal Representative) or organizational sub-units (a particular library branch) with the parent organization.

Here is an example of part of our paper member information form that shows that sort of information that we collect about a member institution:

Screenshot of CRL's member information form

CiviCRM also lets you import contact information and relationship information through comma separated value (CSV) files. However, there are a number of things that need to be configured in order to get this working properly.

Need to have contact types configured correctly for the relationship

This is configured at Administer > Options List > Relationship Types

When you create a new relationship, it sets Contact Type A/Contact Type A to any contact type. This works fine if you are defining relationships within CiviCRM’s web interface, but doesn’t work well when importing contacts. This is because CiviCRM will not be able to correctly match the related contact if the contact type is not explicitly set.

In the case of our “Librarian Councillor of” relationship, Contact A is an Individual (the member organization librarian) and Contact B is an Organization (the member organization):

Configuring a relationship in CiviCRM

Need to update strict matching rules for individuals

CiviCRM has configurable matching criteria for identifying and merging existing duplicate contacts and for updated existing contacts based on import data. This feature is documented in the CiviCRM documentation page Find and Merge Duplicate Contacts.

The matching criteria can be configured at Administer > Manage > Find and Merge Duplicate Contacts. By default CiviCRM defines Strict and Fuzzy rules for each contact type. CiviCRM uses the strict rule when importing contact data. However, the default rules might not fit the data that you have. For instance, by default, the strict rule for matching individuals puts all the weight on e-mail address. For many of the contacts, however, there is not an e-mail address. So, I had to update the Strict rule for Individual contacts to also match on First Name, Last Name, and Phone Number. Note that I set the weight so that all three values must match for CiviCRM to consider the contact a duplicate:

Configuring the duplicate matching rules in CiviCRM

If you don’t configure these rules correctly, you will get duplicate entries when you try to import your contact relationships.

Need to only have one relationship per CSV import file

This is one of the most confusing aspects of the relationship import process. Initially, I tried to put all the relationships in the same CSV file that I used to import the individual contact:

First Name,Middle Name,Last Name,Job Title,Individual Prefix,Individual Suffix,Street Address,Supplemental Address 1,Supplemental Address 2,City,Postal Code Suffix,Postal Code,Address Name,County,State,Country,Phone,Email,Note(s),Employee Of, Librarian Councillor of
Jane,,Doe,Head Librarian,,,123 Fake St.,,,Springfield,,12345,,,Illinois,,123-456-7890,jane.doe@sample.edu,,Sample University, Sample University

That is, in the last 2 columns, I specify that the individual contact (Jane Doe) is an Employee of and the Librarian Councillor of Sample University.

This doesn’t work! I can only specify a single Individual -> Organization relationship in each CSV file. So, I need to break out the Librarian Councillor of relationship into a separate CSV file:

individual_import.csv:

First Name,Middle Name,Last Name,Job Title,Individual Prefix,Individual Suffix,Street Address,Supplemental Address 1,Supplemental Address 2,City,Postal Code Suffix,Postal Code,Address Name,County,State,Country,Phone,Email,Note(s),Employee Of
Jane,,Doe,Head Librarian,,,123 Fake St.,,,Springfield,,12345,,,Illinois,,123-456-7890,jane.doe@sample.edu,,Sample University

librarian_councillor_import.csv:

First Name,Middle Name,Last Name,E-mail,Phone,Librarian Councillor for
Jane,,Doe,jane.doe@sample.edu,,Sample University

I will first import the contact CSV (individual_import.csv), then the relationship CSV (librarian_councillor_import.csv).

Need to include fields in CSV so that matching rules will work

Note that in the above example, I have to be sure to include enough information for our matching rules that I defined before to match Jane Doe to her existing database entry. So, I need to have either an e-mail address or First Name, Last Name, and Phone number.

Need to tell import process how to handle duplicate contacts

When importing the relationships, we will already have imported the individual contact information. So, we just want to update the existing individual contact record to reflect their relationship with their organization. So, we need to set the For Duplicate Contacts option of the import settings to Update.

Configuring CiviCRM import settings

Need to set up relationship import field mappings correctly

The field import mapping setting that I needed for the relationship import file (in this example librarian_councillor_import.csv) wasn’t immediately obvious to me. Here is a screenshot of the configuration that worked:

Configuring import field mappings in CiviCRM

Note that the Librarian Councillor for field in the CSV if mapped to the Library Councillor of relationship (that I defined at Administer > Options List > Relationship Types) and that the option of this mapping is set to Organization Name so that it will try to relate the imported contact to the existing organization contact record with the name specified in the CSV file.

Summary

So, it is possible to import both individual and organizational contacts into CiviCRM as well as the relationships between them. However, this could be tedious because each relationship type must be imported in a separate file. One possible solution would be to have a master spreadsheet that is used to input contact and relationship data. Then the spreadsheet programs filters/macros could be used to export appropriate CSV files for importing the contacts and relationships into CiviCRM. The import process is still somewhat complicated, so it seems best to do have systems staff assist with an initial mass import and then have future contacts input manually through the web interface.

nonlinear digital music narratives

As the start of my graduate program grows nearer.  I feel like I need to talk and think about what I want to do with journalism more concretely.  Last night in conversation, I mentioned that I was interested in exploring how the web and other new media could tell stories outside of the linear narrative structure of a news article or a video documentary.  How does the producer’s or audience’s bias get subverted when the audience can pick multiple paths through the narrative? , Unfortunately, I couldn’t think of any examples of this, but Josh referenced Hyperfiction as a way that this is done with creative writing and how it creates a different, intensly immersive experience for the reader.

Working on a mix tape lately, and getting my record player working again, I’ve been thinking about the linear path through which albums or mixes are constructed.  Sometimes this can be narrative like a concept album, for example Prince Paul’s Prince of Thieves, or it could be more subtle in a record like Springstein’s Nebraska.  While you could certainly play songs on an LP or CD in a different order, digital audio files make this even easier.  Unfortunately, it seems like much of the focus on the benefits of digital audio has been with regards to distribution instead of the possibilities for constructing sets of connected songs with multiple paths through them.  I often read something years later that makes me re-think a Defiance, Ohio song or the songs in relation to each other.  Also, the Allied Media Conference’s recent call for track proposals has made me think about grouping and connecting information.  I also think of the recommended EQ diagram in the In Utero liner notes.  I think it would be pretty cool to release an “album” of songs digitally, with separate recommended orders of the songs and liner notes that describe the different paths through the songs.

Photo by Great Beyond via Flickr.

Is the web suburban?

I’ve been reading Suburban Nation (thanks Sherri) and it made me look at new media ecologies with a city planning eye.  I wonder, is the web suburban?  Do we have memorable, open commons or digital cul-de-sacs?  Certainly, many government sites like the Illinois Tollway site quickly feel like I’m getting lost or running into dead ends.  Are online retailers generally becoming more like mom and pop stores or big boxes?  While the architecture of the web certainly allows for multiple routes through and across sites (akin to traditional city streets), typical navigation structures tend be tree-like which seem more like the disorienting and disconnected street patterns of subdivisions.  Are big infrastructure centers like Google or Amazon like suburban collector roads?  Is this city planning model even a capable metaphor for thinking about information ecologies?

Subdivision photo by futureatlas.com via Flickr.

Masculinity and Sexual Assault Awareness Month

This is a first draft of an op-ed for a group called ManUp! that I’m working with in Bloomington.  I’d appreciate any comments or feedback:

April, sexual assault awareness month makes me tired.  I am tired of seeing women that I respect and care about exhausted as they do the challenging, important, but also extremely difficult work of supporting survivors of rape, sexual assault, and domestic violence and working to raise consciousness which might prevent future violence.  Many of these remarkable friends have experienced violence in their lives and started doing the work that they do because of the lack of support that they experienced.  Their efforts are remarkable and brave yet ultimately they shoulder the weight of their pasts as well as the weight of the survivors that they support and the confused, indifferent, or even hostile voices they encounter doing prevention work.

I am tired of feeling trapped in the same tired discussion in the rare cases that men’s violence and men’s violence against women comes to the surface, whether it is in the lives of celebrities such as Chris Brown or the lives of men in my social circles.  I can try to excuse the violence, weakly dismissing it as stress or substance abuse or as an isolated incident.  Or, I can pat myself on the back, satisfied that at least I am not one of “those men” who chooses to be violent, to harass and intimidate those passing by on the street, who touches someone’s body without their permission, who pressures someone to consume too much alcohol or drugs in the hope of getting lucky, or who seeks to belittle and control intimate partners. In either case, I can’t find the imagination to think of a world where perpetrating and experiencing violence is not a part of manhood – mine, my friends, or Chris Brown’s.

I am tired of a man’s strength being defined by his ability to suppress painful experiences and to downplay the experiences of others rather than crying out and reaching out and working in the hopes that others might be spared those painful experiences.   I am tired of the gentleman’s agreement that we will not speak of our fear of violence from other men or the fear of the violence we have committed or might commit.

Finally, I am tired of the myth that violence against women doesn’t matter to men and that it is not men’s work to end this violence.  It is a myth that I have found comfort in because it excuses my own inaction.  If this myth rings true to me or other men, I fear it is only because we have spent so much energy convincing ourselves that it is true.   When I think of all the effort spent changing the subject to avoid seeming vulnerable, laughing along or remaining silent when a friend tells a cruel, demeaning joke, or convincing myself that it’s not my place to say or do something when I witness or hear about violence it seems like such a waste.  All that energy could have gone into dealing with the violence that men have witnessed or experienced in our lives to make sure that we don’t repeat it.  It could go to defining manhood by our best, most noble qualities instead of the worst of our choices.  It could go towards working earnestly as allies with women to prevent violence that hurts us all.  Sexual assault awareness month is not just a chance to be aware that violence is terrible, that it happens too frequently, or even that it hurts both women and men.  It is also an opportunity to be aware that we can make a different, less violent world.

Graffiti Panic

This is a letter to the editor that I just submitted in response to an editorial in today’s H-T, Graffiti not art; it is vandalism:

I was disappointed by today’s editorial condemning graffiti.  Rather than fostering a nuanced and frank dialog about complicated issues like the state of public and private spaces in Bloomington, the editorial’s intention seemed only to attempt to induce panic.  Why even mention the specter of gang violence when the police department confirms that graffiti in Bloomington has no relation to such violence?  Furthermore, I am disappointed by the brief mention of the “broken windows theory”  and other studies outside of the context of a broader body of research.  This theory, like many sociological theories, is still being widely debated.  For instance, one study by researchers Robert J. Sampson of Harvard University and Stephen W. Raudenbush of the University of Michigan suggests that rather than being inherently problematic to the well-being of a neighborhood, graffiti (among other things) invokes deep-rooted anxieties and prejudices that people have about changing class and race dynamics of a community.  Ultimately, I am far more concerned about the high costs of renting spaces, barriers to starting businesses, and difficulty finding employment in Bloomington.  If we do not address these factors, graffiti may be the only way that many can participate in Bloomington’s downtown.

Register Now for the Allied Media Conference!

Friends!

I’m working on the How-To Track of the Allied Media Conference this year.
There are already a lot of exciting things in the works for it.   I’ve been working on a hands-on project to refurbish and build media workstations using Free/Libre/Open Source software during the conference.  I’ve also been talking about developing a session to use mobile technologies to mobilize people to act quickly to do things like respond to foreclosure evictions.  Finally, I’m always excited about how the AMC respects and prioritizes even the youngest participants with the kids track.

This year’s AMC is going to blow your mind.
I’m writing to ask you to register early, at the $100 level, and help us organize the best conference possible.
I also encourage you to propose a session, whether or not you can lead it.  What do you want to see happen at this year’s AMC?

Register here: http://alliedmediaconference.org/register
Propose a session here: http://alliedmediaconference.org/propose
Read the vision statement for the 2009 AMC, We Are Ready Now, here: http://alliedmediaconference.org/about/mission_vision

I hope to see you in Detroit,

Geoff

Holla at your representatives that you don’t want to fund abstinence-only sex ed.

Act at http://capwiz.com/advofy/utr/2/?a=12162006&i=92217564&c=

This is what I wrote:

I am very concerned about the safety, health, and happiness of youth in Indiana and across the nation.  So, I am writing to ask you to end funding for ineffective abstinence-only-until-marriage education programs including:

* Title V Abstinence Education program, Section 510 of the Social Security Act – (state formula grants), funded at $50 million

* Community-Based Abstinence Education under Title XI of the Social Security Act – (direct grants), funded at $116 million

* Adolescent Family Life Act (Title XX of the Public Health Service Act) abstinence-only grants, funded at $13 million

GRAND TOTAL: $179 million per year

It is my hope that by de-funding programs that don’t work, we can provide support that will help youth in Indiana, and across the U.S., safer, healthier, and equipped to make the best choices in their lives.

I know that my local school district has an abstinence-based curriculum and not an abstinence-only sex education curriculum.  However, the pressure of funding programs that do not fully discuss contraception, STI prevention, and acknowledge the reality that youth in Indiana (and around the US) are sexually active, regardless of whether this is the best choice or not, means that many youth in my community do not have the information they need to be safe and healthy and to encourage their peers to make safe, healthy life choices.

I have first-hand experience working as a volunteer doing presentations about healthy relationships, sexual assault, and domestic violence in Bloomington-area middle schools and high schools.  I have found that, because of the local school district’s and Indiana’s emphasis on abstinence and reluctance to talk about even the biological mechanics of sex, many youth lack the basic information they need to participate in a comprehensive discussion about preventing sexual assault and relationship violence.

This is just one local and personal example about how non-comprehensive, abstinence-only-until-marriage education sex education is failing to make Hoosier youth safe and healthy.  However, there is ample additional evidence at the dangerous shortcomings of such approaches.

Here are the facts:

• In spite of their receiving over 1.5 billion dollars in federal funds since 1996, not a single, sound study has shown these programs to have a beneficial impact on young people’s behavior.

• Recent studies show these programs can create harm by undermining contraceptive use when young people in abstinence-only-until-marriage education become sexually active.  In one study, abstinence-only-until-marriage program participants were one-third less likely to use contraception when they did have sex compared to students not receiving the restrictive abstinence-only education. Nationally, over 60% of young people will have had sex before graduating from high school.

•  Over 135 national organizations, including the country’s major medical organizations like the American Medical Association and the American Academy of Pediatrics, belong to the National Coalition to Support Sexuality Education and strongly believe in teaching young people both abstinence and contraception.

I know that issues around sex and youth can be controversial, but I believe that I stand with the majority of Americans who want comprehensive sex education for their young people.  A 2004 survey by National Public Radio/Kaiser Family Foundation /Harvard University Kennedy School of Government found that 86% of voters want young people to receive a comprehensive approach to sex education that includes teaching about both abstinence and contraception.

By voting to end the 179 million dollars per year funding for the following failed programs, you will be sending a clear message that you support science and common sense.

Both fiscally and in terms of public health, we cannot afford to continue funding this unproven, dangerous approach. Young people’s health and lives are at risk.  We urge you to side with public health, with the medical community, with parents, young people and teachers and oppose any new funding for the abstinence-only-until-marriage programs.

Core technologies/concepts for community organizing

Last summer at the AMC, I presented a session about Web 2.0 and social movements.  Because I inherited the session from someone else, I kept the session proposer’s rubric of introducing technologies/services by name  (Twitter, Jott, del.icio.us) so that people would be able to link the name/buzz with an idea of what it could do.  If I had it to do all over again, I would start with core concepts and technologies that I see as being really helpful with my own use of tech. in organizing.   These would be things that underly a lot of Web 2.0 services and also make technology more fluid for users of all levels of technological familiarity. I’m starting a list here.  What core concepts/technologies do you all use?

RSS Feeds/Aggregation

One of the biggest frustrations that I (and other users I would suspect) have with the multitude of useful sites is having to have a bunch logins and remember which information lives where.  One has to choose between using the right tool for the job and making it easy to locate and access information.  E-mail is one convergence point, but that doesn’t neccessarily mesh with every service that people might use.  Services from del.icio.us to Twitter to Google Calendar to most blogging platforms all allow you to publish RSS feeds.  I would explain what a feed does, show what a feed looks like in various services, and then show how to aggregate and organize feeds with a web-based aggregator and a desktop app.

Feeds are so important because understanding them is crucial for mashing up services or making them easier for collaborations.  Examples:

  • Blog to twitter using Twitterfeed
  • Twitter “mailing” list using #hashtags and RSS Feed for http://search.twitter.com/

Email Filters

People are often overwhelmed by mailing lists, but few know that you can pretty easily filter out all the different kinds of e-mails that you get to do the inbox triage that everyone is familiar with for you.  I think having imapfilter or Thunderbird sort my mail into folders is super-useful, if only to evaluate the actual importance of data.  If I never click on a partcular folder where some of my mail is auto-sorted, do I really need to be on that mailing list anyway?

Human URLs (TinyURL or similar services)

Things like Google Docs often generate long, difficult to remember addresses for important information.  If people have to first dig through an e-mail with a link to a shared resource (and do this every time they want to access it), they’re going to be less likely to use it.  If they can just remember it (or enough of it that it is found in their browser’s location history) I think these online resources will get more use.

Mailing Lists

I think we all take these for granted, but there are ways to use these that make them more or less effective.  What strategies do you use to handle list management and message moderation.  How do you not flood people’s mailboxes?  How do you make it easy for people to (un) subscribe to lists?  This is more a discussion of usage than particular technologies.

Chat

Electronically mediated communication can often be ambiguous.  I find that I often spend extra time trying to disambiguate something in e-mail when it would have been way, way faster to call and let someone as questions.  Still, a lot of collaboration that I do involves looking at text or files together.  Chat is really crucial for these kinds of tasks.  I use it every day at work.

SMS

I don’t have a texting plan and I share my mobile phone, so I’m not the hugest txter but I like that it’s more purvasive than e-mail but less intrusive than a phone call because it lets people get the information first before deciding their timeline or content for response.  It’s also better than a call or voicemail for infromation that you might have to lookup again (a phone number or address for instance).

Skype/Conferencing

For the times when you want to be more personal than chat, voice/video conferencing is perfect.  We have a fancy system at work for having meetings that span Indy and Bton but I think folks can achieve much of the same functionality with Skype, cheap webcams, and projectors.

Paypal

Cash rules everything around me … There are probably better alternatives, especially if the organization seeking cash is a 501(c)3, but Paypal is definitely the easiest to use.  The awesome Pledgie service helps you use Paypal to organize campaigns.