Fuzzy-matching strategies

This is a list of strategies for doing quick fuzzy matches that I’m summarizing from a thread that started on June 9, 2014 on the NICAR-L mailing list.

Fuzzy Lookup Excel Add-on

This add-on created by Microsoft can be downloaded here.

It reportedly runs into trouble when trying to match ~3000 records with another ~3000 records.

Increasing the threshold from it’s default to a higher value might provide better performance.

Reconcile CSV

Reconcile CSV is a project of Open Knowledge labs that is described as

Reconcile-csv is a reconciliation service for OpenRefine running from a CSV file. It uses fuzzy matching to match entries in one dataset to entries in another dataset, helping to introduce unique IDs into the system – so they can be used to join your data painlessly.

MySQL’s Soundex() function

OpenRefine

Dan Nguyen provided this recipe for OpenRefine:

If you’re looking for non-Excel/database solutions…you can also do it by hand with OpenRefine.

  1. Combine both lists into one file with a single name column
  2. Import it into Refine
  3. Create a second column called “refined_name_key” that is a duplicate of the original name field
  4. Cluster and de-dupe using Refine’s text-clustering
  5. Export out (into something like a CSV)
  6. Import this table into your existing setup
  7. Join the name fields of the two original tables against the “refined_name_key”

Paxata

Home