An open-source probabilistic record linkage process for records with family-level information: Simulation study and applied analysis

Abstract

Research with administrative records involves the challenge of limited information in any single data source to answer policy-related questions. Record linkage provides researchers with a tool to supplement administrative datasets with other information about the same people when identified in separate sources as matched pairs. Several solutions are available for undertaking record linkage, producing linkage keys for merging data sources for positively matched pairs of records. In the current manuscript, we demonstrate a new application of the Python RecordLinkage package to family-based record linkages with machine learning algorithms for probability scoring, which we call probabilistic record linkage for families (PRLF). First, a simulation of administrative records identifies PRLF accuracy with variations in match and data degradation percentages. Accuracy is largely influenced by degradation (e.g., missing data fields, mismatched values) compared to the percentage of simulated matches. Second, an application of data linkage is presented to compare regression model estimate performance across three record linkage solutions (PRLF, ChoiceMaker, and Link Plus). Our findings indicate that all three solutions, when optimized, provide similar results for researchers. Strengths of our process, such as the use of ensemble methods, to improve match accuracy are discussed. We then identify caveats of record linkage in the context of administrative data

    Similar works