Finding duplicate records in a books to prisoners database application

High on my list of neglected tech. projects is the Testament books to prisoners database web application.  This is the database program that projects like the Midwest Pages to Prisoners Project use to track packages sent and returned and books requested in the hopes avoiding delays in delivering books to incarcerated people and to provide metrics that grant providers like.

One of the design challenges has to do with duplicate records.  Recipients of books are identified by their state/federal department of correction (DOC) number (if they’re in a state or federal prison – most jails don’t use ID numbers), their state of incarceration and their name.  I assume that the database was designed originally to minimize barriers for the book project volunteers so both the name and DOC# are free text fields.  Javascript is used to match existing records based on the DOC#, but there is still a large possibility for duplicate records.

The reason for duplicate records is that both the person writing to request books and the volunteer may list their name and/or DOC# inconsistently.  For instance, the state may store the DOC# in their database as A-123456 but the incarcerated person may write it as A123456 A-123-456 or just 123456.  Volunteers who don’t know about this and aren’t careful may not check beforehand for an existing record.

This is probably preventable through more sophisticated validation, but we still need a way to find duplicates in the existing records.  As this application is written in the Django framework, I want to try to use the Django API to find matches.

At first thought, it seems like I will have to iterate through each inmate record and check if there is a duplicate record.  This seems pretty slow, but I can’t think of a better way to do this.  At this point, there aren’t so many records that this approach will fail, but it would be nice to do something slicker.

The other problem is how to match a duplicate.  One approach might be to build a regexp for the DOC# (for instance, match either the first character or omit it, allow dashes or spaces between all characters, …) and then use the iregexp field lookup to try to find matches. One challenge with this is that the current Testament codebase is using Django 0.97 (I think) and iregexp is only available starting in 1.0.  Maybe it’s time we updated our code anyway.

There is also the Python difflib module that can compute deltas between strings.  However, it seems like this would slow things down even further because you would have to load each inmate object and then use difflib to compare the DOC#s.  I assume that the previous approach would be faster because the regexp matching happens at the database level.

String mangling in Excel/VBA

Often, I find myself needing to do string mangling in spreadsheets. For instance, in an Excel dump of an MS Access database, the names of institutions were sometimes written as “Foo, University of” or “Foo U”. I wanted these to be formatted as “University of Foo” and “Foo University” respectively. I may be misinformed but there doesn’t seem to be regexp style substitution in VBA. I could do this directly in the spreadsheet, but conditional use of functions gets to be hard to understand and difficult to edit. So, I wrote the following custom function in VBA.

Function FormatInstitutionName(institutionName As Variant) As String
'Formats an institution name for import into CiviCRM:
'"Foo, University of" => "University of Foo"
'"Foo U" => "Foo University"
    Dim newInstitutionName As String

    newInstitutionName = institutionName

    If InStr(newInstitutionName, ", University of") Then
        newInstitutionName = "University of " & Replace(newInstitutionName, ", University of", "")
    End If

    If InStr(Len(newInstitutionName) - 1, newInstitutionName, "U") Then
        newInstitutionName = Left(newInstitutionName, Len(newInstitutionName) - 2) & " University"
    End If

    FormatInstitutionName = newInstitutionName
End Function

Add this to your workbook by following the instructions at http://www.vertex42.com/ExcelArticles/user-defined-functions.html

Then call it in your spreadsheet cells like like:

=FormatInstitutionName(C2)

Django: Querying data from the Python shell

I needed to get some stats for some research that we’re doing and was happy to see that you can use Django and the python shell to query testament data in a way that’s database independent.  It’s a little unintuitive if you’re thinking in SQL mode, but it is usable and super-helpful.  I wanted to share it with ya’ll in case you needed to quickly pull stats or examine info.

Helpful reference Django docs:

  • http://docs.djangoproject.com/en/dev/topics/db/queries/#making-queries
  • http://docs.djangoproject.com/en/dev/ref/models/querysets/#queryset-api-reference

Print the prison name, city, and state of all prisons that received a package sent by the Midwest Pages to Prisones Project from 2009-01-01 to 2009-03-22

geoff@btp:/var/www/testament/testament_trunk/btp$ python manage.py shell
>>> import datetime
>>> from core.models import Prison, Package
>>> start_date = datetime.date(2009, 1, 1)
>>> end_date = datetime.date(2009, 3, 22)
>>> prisons = Prison.objects.filter(package__sent_on__range=(start_date, end_date), package__group__username__exact='mwpp').distinct()
>>> for prison in prisons:
>>>    print "%s %s, %s" % (prison.name, prison.city, prison.state)

Struggling Toward Diversity

The ethnic makeup of my old high school, from greatschools.com.
The ethnic makeup of my old high school, from greatschools.com.

Next week is spring break for IU and public school students in Bloomington.  I have heard that many students’ spring break trips to Mexico have been cancelled because of parental fears related to reports of drug-trade related violence in Mexico.  Bz came over last night for dinner and brought some mole sauce she got on her somewhat-recent trip there.  My own cultural tourism happens across the street from my job at an international market that is run by an ambiguously related group of South Asians and Latino People.  Eating good, exotic food is a wonderful, exciting experience, but it’s a far too cheap and easy (but frequent) way to think of diversity.  Since diversity is something that comes up a lot when I talk about thinking of moving, I feel like I need to have more of a definition for what I want.

This is hard because, as I read Sundown Towns and find so much of the geography of racial segregation (the West Shore of the Susquehanna; Cuyahoga Falls, Ohio; Crossville, Tennessee) is familiar to me, and see school statistics realizing the racial makeup of my high school, it is something that remains unreal, idealized, even mythic to me.  Hopefully this can be a start.

Diversity is:

  • Not celebrated in atomic festivals, media campaigns, concerts, or commemorative months or days.  It is a constant, pervasive, and unforgettable sense of how my life and the lives of those around me is mediated by race, ethnicity, culture, gender, body, and sexual orientation, among other things.
  • An understanding that confusion and conflict between people is often grounded in cultural context and differing experiences.
  • A search for those shared experiences and values that transcend different experiences and histories without the expectations that we can or should share all these things.

hfs+ on linux

I got a new MacBook from work and need to migrate files from my old Dell notebook running Xubuntu Linux.  Luckily, I had recovered a drive from a bricked machine that was donated to pages that I could use to transfer the files.

I don’t like the Fat32 file system, so I formatted the external drive as hfs+.  My workstation, running Debian, mounted the drive fine, but I couldn’t write.  I found that I had to disable journaling on the drive before I could write it in Linux:

$ diskutil disableJournal /Volumes/ghingexternal

Note: that command has to be run on the Mac.

Once I did this, I could write to the disk, but only as root.  Permissions of hfsplus partition, a thread on the Ubuntu message boards, provides this insight which is likely the case:

I got a new MacBook from work and need to migrate files from my old Dell notebook running Xubuntu Linux.  Luckily, I had recovered a drive from a bricked machine that was donated to pages that I could use to transfer the files.

I don’t like the Fat32 file system, so I formatted the external drive as hfs+.  My workstation, running Debian, mounted the drive fine, but I couldn’t write.  I found that I had to disable journaling on the drive before I could write it in Linux:

$ diskutil disableJournal /Volumes/ghingexternal

Note: that command has to be run on the Mac.

Once I did this, I could write to the disk, but only as root.  Permissions of hfsplus partition, a thread on the Ubuntu message boards, provides this insight which is likely part of the problem (since the mountpoint of the hfs+ formatted drive has uid:gid 99:99 on my Linux box):

I have to preface my entry with the warning that I am a complete newbie. I was having the same problem with accessing my files on my hfs+ partition. What I discovered is that by default OSX doesn’t allow any access for the gid for files and folders in your User’s folders. I don’t know if this is the wisest thing, but I went into the Finder, did a “Get Info” on all the files/folders I wanted to access in Ubuntu, I then went under permissions and switched the Group ID to something I could use in Ubuntu. I then made sure that the line in the fstab that mounts my hfs+ partition had a “gid=XXX” statement that matched what I set in OSX. I also made sure that the user I was using in Ubuntu was part of the group mentioned above. If this doesn’t make sense, let me know and I will clarify. Also, if you need help with OSX permissions, here is a link to an Apple KB article: http://docs.info.apple.com/article.html?artnum=107039

Imapfilter certs

I was having trouble SSHing into my workstation.  It would just hang at the login.  I was worried that I had gotten 0wn3d.  I logged in at the console and ran top and saw that there were a bunch of runaway imapfilter proccesses from my cron runs.

I ran imapfilter from the command line and got the following error:

ATTENTION: SSL/TLS certificate fingerprint mismatch.
Proceed with the connection (y/n)? y

In order to get rid of this error and make it cronable again, I had to delete the contents of ~/.imapfilter/certificates and re-run imapfilter from the command line, telling the program to accept the cert permanently.

This mailing list post was very helpful.

I’m still going to lock down my box anyway.


	

Setting up Embarq DSL with Wifi using Microsoft MN-700 wireless access point

My parents and brother use Embarq DSL and they just got a used Wireless router. Like most folks, they either had the DSL set up for them, or did the initial configuration without understanding or remembering what they did, so I’m writing this as a reference for them, for other family tech support people working with the same hardware or services, or as a general framework for how I figure these things out.

The DSL Service

The first thing I did was to try to figure out how their DSL connection worked. In most cases I’ve seen, there is either a DSL modem and the computer handles the PPPoE connection or the modem is also a router.

My parents’ systems are running Windows XP, so I went to Start->Connect To->Show All Connections. There were only the connections under the heading LAN or High Speed Internet. If the computer was doing PPPoE, it would be listed under the heading Broadband. When I double-clicked on the Local Area Networ€k connection icon and the Support tab, I saw that the computer was getting an IP address of 192.168.2.2. So, it looks like their DSL modem is also doing routing. The much easier way to tell this would have just been to flip over their modem and see that it was labeled ADSL router, although I’m guessing that most ADSL modems/routers can disable the routing, so maybe it’s a good to check and see how the computer is actually connecting.

The ADSL hardware

My family is using an Embarq EQ-66OR ADSL Router

The Wifi hardware

My brother acquired a Microsoft MN-700 Wireless Base Station. Since they got the hardware second hand, it didn’t come with any documentation. Luckily, I found it here: MN-700 Base Station Configuration Guide

Determining hardware capabilities

Since the DSL modem is doing the routing too, I think the easiest thing to do would be to configure the wireless base station to just act as a base station and not do any routing. Then I don’t have to figure out how to configure the modem or what my family’s login credentials are for the DSL service.

I looked at the table of contents for the configuration guide and saw that there were sections for Local Area Connection, Wide Area Connection, and Wireless Mode. What I want to do is to configure the access point in what is often called bridge mode. I started looking at the wireless mode section and realized that it was just choosing between 802.11b and 802.11g. I looked at the Wide Area Connection and Local Area Connection section and it seemed like you had to set a DHCP range for the local network. This wasn’t looking good. I searched for the word bridge and nothing came up. I quickly scanned through the documentation and found the section Base Station Mode. This is what I wanted! This illustrates the point that its always good to (quickly) Read The Friendly Manual.

Planning the network

Now that I know that the hardware can do what I want, and I know which hardware will do what, I can plan the network.

The ADSL modem (EQ-66OR) will do the PPPoE connection and the routing.

The wireless base station (MN-700) will just pass through wifi connections to the router.

The modem and the router will be connected using a standard ethernet cable going from the yellow port labeled Ethernet on the back of the DSL modem to the port labeled To Modem on the side of the wireless base station. Computers will either connect to the network wirelessly or through an Ethernet cable connected to the ports labeled 1-4 on the wireless base station.

So, for the basic connection, I’ll need:

  • The wireless base station
  • The ADSL modem/router
  • An Ethernet cable
  • Space in a power strip for the wall warts (AC adapters) for both the DSL modem and the wireless access point

Configuring the hardware

I should, hopefully, be able to leave the DSL modem configured the way it is, so all I’ll have to configure is the wireless access point.

I connected the DSL router from the computer and connected it to the wireless access point as I specified above.  I powered on the wireless access point and saw that a wireless network connection MSHOME was available.  I connected to this network.  From the docs, I found that I need to connect to the address http://192.168.2.1 in my browser in order to configure the access point.  This didn’t work, so I went back to Start->Connect To->Show all connections and double clicked on the Wirleless Area Connection and the Support tab to see that the computer was getting an IP address of 192.168.0.2 with a gateway of 192.168.0.4, so I typed http://192.168.0.1 in my browser.  I guess I should change my earlier rule to Quickly Read The Friendly Manual But Be Flexible (QRTFMBBF)  This still didn’t get me to an adminstration page. Sometimes access to the administrative pages is disabled for a wireless connection, so I disconnected from the wireless network and connected the computer to the access point with another ethernet cable.  I find that when doing these kinds of network installs its always good to have an extra network cable or two at hand.

When I connected to the access point with the wired network, the computer got an address of 192.168.2.4 and I was able to access the administration pages with my browser at http://192.168.2.1.

The administration page asked me for a password.  My brother said that he thought he had reset the base station back to its factory settings, so I tried the default password admin, which worked.

Set a new administration password

For security, I went to General settings->Change password and asked my mom for a password that she would remember.  I also made a secure note of this for future reference.

Put the access point in Base Station Mode

I set this by following the links to Security->Base Station Mode and chose the Access Point Mode radio button.  I had to select an access point name and chose tenhopedrive, my parents address, because I like giving wifi nodes some geographic significance.  I would use this name in my browser http://tenhopedrive to connect to the admin pages.  This is exactly what I had to do, because after changing the above setting, it rebooted the access point, logging me out of the admin pages.

Set access point SSID

I did this by following the Wireless Settings link and typed tenhopedrive (for consistency) into the Wireless network name (SSID) field and clicked the apply button.  Once again this caused the access point to reboot, and I had to reconnect to the wireless network under the new SSID.  I had to refresh the list of available wireless networks a few times in order to see the new name.

Testing the network

There were a few things that I wanted to make sure worked right, because I wasn’t sure how they would work.

Wired Network

Putting the access point into access point mode, I wasn’t sure how this would affect the wired network.  I made sure the computer was still connected to the access point by the Ethernet cable, disconnected from all wireless networks, and made sure I could access google.com in my browser.

Wireless Network Connection

I then disconnected the Ethernet connection and connected to the wireless connection and made sure that I could still connect to google.com in my browser.

Multiple Computers

The last thing I wanted to test, which is probably something I should have investigated initially, is whether or not multiple computers could connect through the DSL router. I wasn’t sure if they had set up some kind of MAC address filtering or something to only allow my mom’s laptop to connect.  I fired up my laptop and connected to the wireless network I just created to see if I could connect to google.com in my browser.  This worked just fine.

Afterthought: Port Fowarding

You know what they say about the best laid plans of mice and men.  I forgot that my brother might want to use port forwarding for P2P filesharing or chat apps.  With the way I configured it, I would have to configure this on the ADSL modem/router.  So, I tried to access this by pointing my browser at the gateway IP reported by Start->Connect To->Show All Connections->Wireless Area Connection->Support.  I found that I got an admin page login, and that the password was still set to the default, 1234.  This was bad since I wanted to have the wireless network be open (to make it easier for my family to not have to remember and additional password or get locked out of their wifi if their computers forgot the stored password and for good-neighborness).  So, I reset the router admin password to the same admin password as the access point and made a note of this.

I confirmed that I could set up port forwarding on the router and that they also had dyndns settings available, something that wasn’t available, as far as I could tell on the access point, so it looks like I made the right choice in using the ADSL modem/router for handling routing.  I went ahead and set this up too, but that’s beyond the scope of this howto.

hasLayout

Doing some web design and making stuff work for IE is making me crazy.  This is largely in part to the hasLayout property in IE.  On having layout gives a super-helpful rundown. 

I’ve been using the IE Developer Toolbar to help me hunt down some of these problems, and this forum post provided a good clarification:

-1 means haslayout is true, 0 means it’s false

and if you check the box to “show default styles” it will show you the haslayout status regardless of whether it’s explicitly set or not This is proving very useful in finding the child selector override hacks that are breaking