{"id":2746,"date":"2014-06-16T14:34:48","date_gmt":"2014-06-16T19:34:48","guid":{"rendered":"http:\/\/blogs.terrorware.com\/geoff\/?p=2746"},"modified":"2014-06-16T14:34:48","modified_gmt":"2014-06-16T19:34:48","slug":"fuzzy-matching-strategies","status":"publish","type":"post","link":"http:\/\/blogs.terrorware.com\/geoff\/2014\/06\/16\/fuzzy-matching-strategies\/","title":{"rendered":"Fuzzy-matching strategies"},"content":{"rendered":"<p>This is a list of strategies for doing quick fuzzy matches that I&#8217;m summarizing from a thread that started on June 9, 2014 on the NICAR-L mailing list.<\/p>\n<h3>Fuzzy Lookup Excel Add-on<\/h3>\n<p>This add-on created by Microsoft can be downloaded <a href=\"http:\/\/www.microsoft.com\/en-us\/download\/details.aspx?id=15011\">here<\/a>.<\/p>\n<p>It reportedly runs into trouble when trying to match ~3000 records with another ~3000 records.<\/p>\n<p>Increasing the threshold from it&#8217;s default to a higher value might provide better performance.<\/p>\n<h3>Reconcile CSV<\/h3>\n<p><a href=\"http:\/\/okfnlabs.org\/reconcile-csv\/\">Reconcile CSV<\/a> is a project of Open Knowledge labs that is described as<\/p>\n<blockquote><p>\n  Reconcile-csv is a reconciliation service for OpenRefine running from a CSV file. It uses fuzzy matching to match entries in one dataset to entries in another dataset, helping to introduce unique IDs into the system &#8211; so they can be used to join your data painlessly.\n<\/p><\/blockquote>\n<h3>MySQL\u2019s Soundex() function<\/h3>\n<h3>OpenRefine<\/h3>\n<p>Dan Nguyen provided this recipe for OpenRefine:<\/p>\n<p>If you&#8217;re looking for non-Excel\/database solutions&#8230;you can also do it by hand with OpenRefine.<\/p>\n<blockquote>\n<ol>\n<li>Combine both lists into one file with a single name column<\/li>\n<li>Import it into Refine<\/li>\n<li>Create a second column called &#8220;refined_name_key&#8221; that is a duplicate of the original name field<\/li>\n<li>Cluster and de-dupe using Refine&#8217;s text-clustering <\/li>\n<li>Export out (into something like a CSV)<\/li>\n<li>Import this table into your existing setup<\/li>\n<li>Join the name fields of the two original tables against the &#8220;refined_name_key&#8221;<\/li>\n<\/ol>\n<\/blockquote>\n<h3>Paxata<\/h3>\n<blockquote class=\"wp-embedded-content\" data-secret=\"duhd9mvmoL\"><p><a href=\"https:\/\/www.paxata.com\/\">Home<\/a><\/p><\/blockquote>\n<p><iframe loading=\"lazy\" class=\"wp-embedded-content\" sandbox=\"allow-scripts\" security=\"restricted\" style=\"position: absolute; clip: rect(1px, 1px, 1px, 1px);\" title=\"&#8220;Home&#8221; &#8212; Paxata\" src=\"https:\/\/www.paxata.com\/embed\/#?secret=duhd9mvmoL\" data-secret=\"duhd9mvmoL\" width=\"600\" height=\"338\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"><\/iframe><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is a list of strategies for doing quick fuzzy matches that I&#8217;m summarizing from a thread that started on June 9, 2014 on the NICAR-L mailing list. Fuzzy Lookup Excel Add-on This add-on created by Microsoft can be downloaded here. It reportedly runs into trouble when trying to match ~3000 records with another ~3000&hellip; <a class=\"more-link\" href=\"http:\/\/blogs.terrorware.com\/geoff\/2014\/06\/16\/fuzzy-matching-strategies\/\">Continue reading <span class=\"screen-reader-text\">Fuzzy-matching strategies<\/span><\/a><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[1],"tags":[],"class_list":["post-2746","post","type-post","status-publish","format-standard","hentry","category-uncategorized","entry"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p4wnIz-Ii","_links":{"self":[{"href":"http:\/\/blogs.terrorware.com\/geoff\/wp-json\/wp\/v2\/posts\/2746","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/blogs.terrorware.com\/geoff\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/blogs.terrorware.com\/geoff\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/blogs.terrorware.com\/geoff\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"http:\/\/blogs.terrorware.com\/geoff\/wp-json\/wp\/v2\/comments?post=2746"}],"version-history":[{"count":1,"href":"http:\/\/blogs.terrorware.com\/geoff\/wp-json\/wp\/v2\/posts\/2746\/revisions"}],"predecessor-version":[{"id":2747,"href":"http:\/\/blogs.terrorware.com\/geoff\/wp-json\/wp\/v2\/posts\/2746\/revisions\/2747"}],"wp:attachment":[{"href":"http:\/\/blogs.terrorware.com\/geoff\/wp-json\/wp\/v2\/media?parent=2746"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/blogs.terrorware.com\/geoff\/wp-json\/wp\/v2\/categories?post=2746"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/blogs.terrorware.com\/geoff\/wp-json\/wp\/v2\/tags?post=2746"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}