TY - JOUR
T1 - Automated linking of historical data
AU - Abramitzky, Ran
AU - Boustan, Leah
AU - Eriksson, Katherine
AU - Feigenbaum, James
AU - Pérez, Santiago
N1 - Publisher Copyright:
© 2021 American Economic Association. All rights reserved.
PY - 2021/9
Y1 - 2021/9
N2 - The recent digitization of complete count census data is an extraordinary opportunity for social scientists to create large longitudinal datasets by linking individuals from one census to another or from other sources to the census. We evaluate different automated methods for record linkage, performing a series of comparisons across methods and against hand linking. We have three main findings that lead us to conclude that automated methods perform well. First, a number of automated methods generate very low (less than 5 percent) false positive rates. The automated methods trace out a frontier illustrating the trade-off between the false positive rate and the (true) match rate. Relative to more conservative automated algorithms, humans tend to link more observations but at a cost of higher rates of false positives. Second, when human linkers and algorithms use the same linking variables, there is relatively little disagreement between them. Third, across a number of plausible analyses, coefficient estimates and parameters of interest are very similar when using linked samples based on each of the different automated methods. We provide code and Stata commands to implement the various automated methods.
AB - The recent digitization of complete count census data is an extraordinary opportunity for social scientists to create large longitudinal datasets by linking individuals from one census to another or from other sources to the census. We evaluate different automated methods for record linkage, performing a series of comparisons across methods and against hand linking. We have three main findings that lead us to conclude that automated methods perform well. First, a number of automated methods generate very low (less than 5 percent) false positive rates. The automated methods trace out a frontier illustrating the trade-off between the false positive rate and the (true) match rate. Relative to more conservative automated algorithms, humans tend to link more observations but at a cost of higher rates of false positives. Second, when human linkers and algorithms use the same linking variables, there is relatively little disagreement between them. Third, across a number of plausible analyses, coefficient estimates and parameters of interest are very similar when using linked samples based on each of the different automated methods. We provide code and Stata commands to implement the various automated methods.
UR - http://www.scopus.com/inward/record.url?scp=85115956762&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115956762&partnerID=8YFLogxK
U2 - 10.1257/JEL.20201599
DO - 10.1257/JEL.20201599
M3 - Article
AN - SCOPUS:85115956762
SN - 0022-0515
VL - 59
SP - 865
EP - 918
JO - Journal of Economic Literature
JF - Journal of Economic Literature
IS - 3
ER -