This is a list of strategies for doing quick fuzzy matches that I’m summarizing from a thread that started on June 9, 2014 on the NICAR-L mailing list.
Fuzzy Lookup Excel Add-on
This add-on created by Microsoft can be downloaded here.
It reportedly runs into trouble when trying to match ~3000 records with another ~3000 records.
Increasing the threshold from it’s default to a higher value might provide better performance.
Reconcile CSV
Reconcile CSV is a project of Open Knowledge labs that is described as
Reconcile-csv is a reconciliation service for OpenRefine running from a CSV file. It uses fuzzy matching to match entries in one dataset to entries in another dataset, helping to introduce unique IDs into the system – so they can be used to join your data painlessly.
MySQL’s Soundex() function
OpenRefine
Dan Nguyen provided this recipe for OpenRefine:
If you’re looking for non-Excel/database solutions…you can also do it by hand with OpenRefine.
- Combine both lists into one file with a single name column
- Import it into Refine
- Create a second column called “refined_name_key” that is a duplicate of the original name field
- Cluster and de-dupe using Refine’s text-clustering
- Export out (into something like a CSV)
- Import this table into your existing setup
- Join the name fields of the two original tables against the “refined_name_key”