Search CORE

44,555 research outputs found

Automated linking of historical data

Author: Abramitzky Ran
Boustan Leah Platt
Eriksson Katherine
Feigenbaum James
Pérez Santiago
Publication venue: 'National Bureau of Economic Research'
Publication date: 01/05/2019
Field of study

The recent digitization of complete count census data is an extraordinary opportunity for social scientists to create large longitudinal datasets by linking individuals from one census to another or from other sources to the census. We evaluate different automated methods for record linkage, performing a series of comparisons across methods and against hand linking. We have three main findings that lead us to conclude that automated methods perform well. First, a number of automated methods generate very low (less than 5%) false positive rates. The automated methods trace out a frontier illustrating the tradeoff between the false positive rate and the (true) match rate. Relative to more conservative automated algorithms, humans tend to link more observations but at a cost of higher rates of false positives. Second, when human linkers and algorithms use the same linking variables, there is relatively little disagreement between them. Third, across a number of plausible analyses, coefficient estimates and parameters of interest are very similar when using linked samples based on each of the different automated methods. We provide code and Stata commands to implement the various automated methods.Accepted manuscriptFirst author draf

Boston University Institutional Repository (OpenBU)

Recommended from our members

Nutrient Estimation from 24-Hour Food Recalls Using Machine Learning and Database Mapping: A Case Study with Lactose.

Author: Bouzid Yasmine Y
Burnett Dustin J
Chin Elizabeth L
Kan Annie
Lemay Danielle G
Simmons Gabriel
Tagkopoulos Ilias
Publication venue: eScholarship, University of California
Publication date: 01/12/2019
Field of study

The Automated Self-Administered 24-Hour Dietary Assessment Tool (ASA24) is a free dietary recall system that outputs fewer nutrients than the Nutrition Data System for Research (NDSR). NDSR uses the Nutrition Coordinating Center (NCC) Food and Nutrient Database, both of which require a license. Manual lookup of ASA24 foods into NDSR is time-consuming but currently the only way to acquire NCC-exclusive nutrients. Using lactose as an example, we evaluated machine learning and database matching methods to estimate this NCC-exclusive nutrient from ASA24 reports. ASA24-reported foods were manually looked up into NDSR to obtain lactose estimates and split into training (n = 378) and test (n = 189) datasets. Nine machine learning models were developed to predict lactose from the nutrients common between ASA24 and the NCC database. Database matching algorithms were developed to match NCC foods to an ASA24 food using only nutrients ("Nutrient-Only") or the nutrient and food descriptions ("Nutrient + Text"). For both methods, the lactose values were compared to the manual curation. Among machine learning models, the XGB-Regressor model performed best on held-out test data (R2 = 0.33). For the database matching method, Nutrient + Text matching yielded the best lactose estimates (R2 = 0.76), a vast improvement over the status quo of no estimate. These results suggest that computational methods can successfully estimate an NCC-exclusive nutrient for foods reported in ASA24

eScholarship - University of California

SLIM : Scalable Linkage of Mobility Data

Author: Atluri Gowtham
Basik Fuat
Corless Robert
E.
Goga Oana
Kieu Tung
Reynolds Douglas A.
Sharma Vishal
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

We present a scalable solution to link entities across mobility datasets using their spatio-temporal information. This is a fundamental problem in many applications such as linking user identities for security, understanding privacy limitations of location based services, or producing a unified dataset from multiple sources for urban planning. Such integrated datasets are also essential for service providers to optimise their services and improve business intelligence. In this paper, we first propose a mobility based representation and similarity computation for entities. An efficient matching process is then developed to identify the final linked pairs, with an automated mechanism to decide when to stop the linkage. We scale the process with a locality-sensitive hashing (LSH) based approach that significantly reduces candidate pairs for matching. To realize the effectiveness and efficiency of our techniques in practice, we introduce an algorithm called SLIM. In the experimental evaluation, SLIM outperforms the two existing state-of-the-art approaches in terms of precision and recall. Moreover, the LSH-based approach brings two to four orders of magnitude speedup

arXiv.org e-Print Archive

Crossref

Bilkent University Institutional Repository

Warwick Research Archives Portal Repository

Approaches to canine health surveillance

Author: Brodbelt D C
Church D B
McGreevy P D
O'Neill D G
Sydney
Thomson P C
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Effective canine health surveillance systems can be used to monitor disease in the general population, prioritise disorders for strategic control and focus clinical research, and to evaluate the success of these measures. The key attributes for optimal data collection systems that support canine disease surveillance are representativeness of the general population, validity of disorder data and sustainability. Limitations in these areas present as selection bias, misclassification bias and discontinuation of the system respectively. Canine health data sources are reviewed to identify their strengths and weaknesses for supporting effective canine health surveillance. Insurance data benefit from large and well-defined denominator populations but are limited by selection bias relating to the clinical events claimed and animals covered. Veterinary referral clinical data offer good reliability for diagnoses but are limited by referral bias for the disorders and animals included. Primary-care practice data have the advantage of excellent representation of the general dog population and recording at the point of care by veterinary professionals but may encounter misclassification problems and technical difficulties related to management and analysis of large datasets. Questionnaire surveys offer speed and low cost but may suffer from low response rates, poor data validation, recall bias and ill-defined denominator population information. Canine health scheme data benefit from well-characterised disorder and animal data but reflect selection bias during the voluntary submissions process. Formal UK passive surveillance systems are limited by chronic under-reporting and selection bias. It is concluded that active collection systems using secondary health data provide the optimal resource for canine health surveillance

Humane Society Institute for Science and Policy

Crossref

Springer - Publisher Connector

PubMed Central

WBI Studies Repository (WellBeing International)

VetCompass Australia: A National Big Data Collection System for Veterinary Science

Author: Baldwin T
Brodbelt D C
Combs M
Dhand N
Gilkerson J
Hammond J
Hill P
Irons P
Irwin P
Mansfield C
Masters S
McGreevy P
Peaston A
Raidal S
Rand J
Raubenheimer D
Soares Magalhaes R J
Squires R
Thomson P
Publication venue: 'MDPI AG'
Publication date: 01/01/2017
Field of study

VetCompass Australia is veterinary medical records-based research coordinated with the global VetCompass endeavor to maximize its quality and effectiveness for Australian companion animals (cats, dogs, and horses). Bringing together all seven Australian veterinary schools, it is the first nationwide surveillance system collating clinical records on companion-animal diseases and treatments. VetCompass data service collects and aggregates real-time, clinical records for researchers to interrogate, delivering sustainable and cost-effective access to data from hundreds of veterinary practitioners nationwide. Analysis of these clinical records will reveal geographical and temporal trends in the prevalence of inherited and acquired diseases, identify frequently prescribed treatments, revolutionize clinical auditing, help the veterinary profession to rank research priorities, and assure evidence-based companion-animal curricula in veterinary schools. VetCompass Australia will progress in three phases: (1) roll-out of the VetCompass platform to harvest Australian veterinary clinical record data; (2) development and enrichment of the coding (data-presentation) platform; and (3) creation of a world-first, real-time surveillance interface with natural language processing (NLP) technology. The first of these three phases is described in the current article. Advances in the collection and sharing of records from numerous practices will enable veterinary professionals to deliver a vastly improved level of care for companion animals that will improve their quality of life

Multidisciplinary Digital Publishing Institute

Crossref

Adelaide Research & Scholarship

Directory of Open Access Journals

ResearchOnline at James Cook University

Research Repository

Sydney eScholarship

University of Melbourne Institutional Repository

University of Queensland eSpace

Big data and data repurposing – using existing data to answer new questions in vascular dementia research

Author: A Abdul-Rahim
AH Noel-Storr
Andreas Charidimou
Charlotte J. Roberts
CN Martyn
Craig W. Ritchie
CW Ritchie
D Jang
D Kerr
D Religa
DA Levine
Deborah A. Levine
Fergus N. Doubal
G Mead
G Perera
G Perera
G Rands
G. David Batty
GE Mead
Gillian Mead
HA Mucke
Hermann A. M. Mucke
HL Dunn
I Chalmers
J Danesh
J Mindell
J Sultana
JPA Ioannidis
L Gray
M Ali
M Brainin
M Jokela
M Porter
M Porter
Maria Eriksdotter
Martin Hofmann-Apitius
Myzoon Ali
NL Catlett
P Langhorne
R Lees
R Patel
R Stewart
R Xu
Robert Stewart
S Garcia-Ptacek
SK McCann
T Hoffman
T Skillbäck
TC Russ
TC Russ
TC Russ
TC Russ
Terence J. Quinn
TJ Quinn
Tom C. Russ
William Whiteley
Yun-Hee Kim
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Introduction: Traditional approaches to clinical research have, as yet, failed to provide effective treatments for vascular dementia (VaD). Novel approaches to collation and synthesis of data may allow for time and cost efficient hypothesis generating and testing. These approaches may have particular utility in helping us understand and treat a complex condition such as VaD. Methods: We present an overview of new uses for existing data to progress VaD research. The overview is the result of consultation with various stakeholders, focused literature review and learning from the group’s experience of successful approaches to data repurposing. In particular, we benefitted from the expert discussion and input of delegates at the 9th International Congress on Vascular Dementia (Ljubljana, 16-18th October 2015). Results: We agreed on key areas that could be of relevance to VaD research: systematic review of existing studies; individual patient level analyses of existing trials and cohorts and linking electronic health record data to other datasets. We illustrated each theme with a case-study of an existing project that has utilised this approach. Conclusions: There are many opportunities for the VaD research community to make better use of existing data. The volume of potentially available data is increasing and the opportunities for using these resources to progress the VaD research agenda are exciting. Of course, these approaches come with inherent limitations and biases, as bigger datasets are not necessarily better datasets and maintaining rigour and critical analysis will be key to optimising data use

Crossref

Harvard University - DASH

Fraunhofer-ePrints

Directory of Open Access Journals

UCL Discovery

Edinburgh Research Explorer

Enlighten

ResearchOnline@GCU

Deep Blue Documents at the University of Michigan

Automated SNP genotype clustering algorithm to improve data completeness in high-throughput SNP genotyping datasets from custom arrays

Author: Edward M. Smith
Hardenbol
Hardenbol
Hua
Huentelman
Jack Littrell
Kissebah
Lamy
Liu
McPeek
Michael Olivier
Olivier
Rabbee
Smith
Sonnenberg
Xiao
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

High-throughput SNP genotyping platforms use automated genotype calling algorithms to assign genotypes. While these algorithms work efficiently for individual platforms, they are not compatible with other platforms, and have individual biases that result in missed genotype calls. Here we present data on the use of a second complementary SNP genotype clustering algorithm. The algorithm was originally designed for individual fluorescent SNP genotyping assays, and has been optimized to permit the clustering of large datasets generated from custom-designed Affymetrix SNP panels. In an analysis of data from a 3K array genotyped on 1,560 samples, the additional analysis increased the overall number of genotypes by over 45,000, significantly improving the completeness of the experimental data. This analysis suggests that the use of multiple genotype calling algorithms may be advisable in high-throughput SNP genotyping experiments. The software is written in Perl and is available from the corresponding author

Elsevier - Publisher Connector

Crossref

PubMed Central

Warwick Research Archives Portal Repository

Automated cleansing of POI databases

Author: A. Bronselaer
A. Bronselaer
A. Bronselaer
A. Bronselaer
A. Bronselaer
C. Baral
C. Baral
D. Dubois
G. Bordogna
G. Cooman De
G. Nachouki
G. Tré De
H. Foley
I. Bloch
I. Fellegi
J. Dujmović
J. Lin
J. Lin
L.A. Zadeh
L.A. Zadeh
M. Bright
M.A. Rodríguez
P. Carrara
R. Torres
R. Yager
R. Yager
R. Yager
R.W. Sinnott
S. Destercke
S. Konieczny
S. Rahimi
S. Sandri
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Crossref

Ghent University Academic Bibliography