7 research outputs found
GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis
Inferring subject ancestry using genetic data is an important step in genetic association studies, required for dealing with population stratification. It has become more challenging to infer subject ancestry quickly and accurately since large amounts of genotype data, collected from millions of subjects by thousands of studies using different methods, are accessible to researchers from repositories such as the database of Genotypes and Phenotypes (dbGaP) at the National Center for Biotechnology Information (NCBI). Study-reported populations submitted to dbGaP are often not harmonized across studies or may be missing. Widely-used methods for ancestry prediction assume that most markers are genotyped in all subjects, but this assumption is unrealistic if one wants to combine studies that used different genotyping platforms. To provide ancestry inference and visualization across studies, we developed a new method, GRAF-pop, of ancestry prediction that is robust to missing genotypes and allows researchers to visualize predicted population structure in color and in three dimensions. When genotypes are dense, GRAF-pop is comparable in quality and running time to existing ancestry inference methods EIGENSTRAT, FastPCA, and FlashPCA2, all of which rely on principal components analysis (PCA). When genotypes are not dense, GRAF-pop gives much better ancestry predictions than the PCA-based methods. GRAF-pop employs basic geometric and probabilistic methods; the visualized ancestry predictions have a natural geometric interpretation, which is lacking in PCA-based methods. Since February 2018, GRAF-pop has been successfully incorporated into the dbGaP quality control process to identify inconsistencies between study-reported and computationally predicted populations and to provide harmonized population values in all new dbGaP submissions amenable to population prediction, based on marker genotypes. Plots, produced by GRAF-pop, of summary population predictions are available on dbGaP study pages, and the software, is available at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/Software.cgi
Recommended from our members
Points to consider for sharing variant-level information from clinical genetic testing with ClinVar
Data sharing between laboratories, clinicians, researchers, and patients is essential for improvements and standardization in genomic medicine; encouraging genomic data sharing (GDS) is a key activity of the National Institutes of Health (NIH)-funded Clinical Genome Resource (ClinGen). The ClinGen initiative is dedicated to evaluating the clinical relevance of genes and variants for use in precision medicine and research. Currently, data originating from each of the aforementioned stakeholder groups is represented in ClinVar, a publicly available repository of genomic variation, and its relationship to human health hosted by the National Center for Biotechnology Information at the NIH. Although policies such as the 2014 NIH GDS policy are clear regarding the mandate for informed consent for broad data sharing from research participants, no clear guidance exists on the level of consent appropriate for the sharing of information obtained through clinical testing to advance knowledge. ClinGen has collaborated with ClinVar and the National Human Genome Research Institute to develop points to consider for clinical laboratories on sharing de-identified variant-level data in light of both the NIH GDS policy and the recent updates to the Common Rule. We propose specific data elements from interpreted genomic variants that are appropriate for submission to ClinVar when direct patient consent was not sought and describe situations in which obtaining informed consent is recommended