Search CORE

2 research outputs found

Imputation and quality control steps for combining multiple genome-wide datasets

Author: Amber eBurt
Bahram eNamjou
Dana eCrawford
David eCrosslin
Elizabeth ePugh
Gail P Jarvik
Gerard eTromp
Gretta D Armstrong
Helena eKuivaniemi
Jonathan L Haines
Kimberly eDerr
Leah Claire Kottyan
Mariza ede Andrade
Marylyn D Ritchie
Rongling eLi
Shefali S Verma
Shubhabrata eMukherjee
Yuki eBradford
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2014
Field of study

The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 52,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR

Directory of Open Access Journals

Frontiers - Publisher Connector

PubMed Central

Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records

Author: Amber eBurt
Anastasia M. Lucas
Dana C. Crawford
Dana C. Crawford
Daniel Seung Kim
David Russell Crosslin
David Russell Crosslin
Gail P. Jarvik
Gail P. Jarvik
Gerard eTromp
Helena eKuivaniemi
John A. Heit
M. Geoffrey Hayes
Mariza eDe Andrade
Marylyn D Ritchie
Sebastian M. Armasu
Shefali S Verma
Yuki eBradford
Publication venue: 'Frontiers Media SA'
Publication date: 01/11/2014
Field of study

Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, with the advantage of fewer covariates and degrees of freedom.principal component analysis, ancestry, biobank, loadings, genetic association stud

Directory of Open Access Journals

PubMed Central