Motivated by the increasing need of linking records across various databases, we have developed a novel graphical model based classifier that uses a mixture of Poisson distributions with latent variables to achieve this. This helps create a single source of truth - data lake after correcting duplicate entries. The idea is to derive insight into each pair of hypothesis records that match by inferring its underlying latent rate of error using Bayesian Modelling techniques. The novel approach of using gamma priors for learning the latent variables along with supervised labels is unique and allows for active learning. The naïve assumption is to not undermine the hierarchical dependencies that could be present in different scenarios but rather to illustrate a broader theory. This classifier is able to work with sparse data, streaming data as well as behave as an active learning system that could work with data dumps. The application to record linkage is able to meet several challenges of sparsity, data streams and varying nature of the datasets. The accuracy obtained is much better than the conventional approach of fuzzy linking algorithms.
How it works
Gammalink takes two texts as inputs and uses a statistical online learning probabilistic method to output a probability of match. This makes retraining easier and faster on new data that is acquired. The model currently uses a threshold that is learnt from data. The training outputs a probability of match. A PR curve must be plotted and depending on business use case, an optimum threshold needs to be identified. The threshold is used to classify match/non-match. Retraining is performed by taking additional data - to improve model accuracy.
The texts must be aligned if they are addresses. For example, 5 Chrysler rd. Natick, MA 01760 must match 5 Chrysler Road Natick Massachusetts 01760 and not 5 Chrysler road, MA Natick.
Where these are primarily used for addresses, identifying parts of speech is critical. This means we need to know which is road, which part is city, state, etc.. prior to linking records.
The online learning facilitates learning on only additional data saving time and effort.
Generic Applications involve creation of data lake from several data pools. This can be helpful to derive data insights or to build predictive models from the data. The accuracy achieved on Fodors and Zagat restaurants was 88.501 p.c and 97.788 p.c on ACM scholars dataset. We compared this to the record linkage toolbox that is using the popular [Christen, 2012] algorithms.
This Python Record Linkage Toolkit is a library to link records in or between data sources. This could include examples such as creation of up-to-date employee records across the organization. Creation of up-to-date client data base across supply chain, service, sales and finance functions. This can help sales figure out whom to target for campaigns.
Linking two addresses such as: John Littleton, 8 Chrysler Rd, Natick, MA 01760 & J. Littleton, 8 Chrysler Road, Natick, Massachusetts, USA.
Link two databases with names as ids: J. Krishnamurthy & Jiddu Krishnamoorthy
Fixing Pharma company dictation errors such as: Mucinex & Muse nex
Fixing natural language processing errors such as: “This is a fable“ vs ”This is a table“