Search CORE

41 research outputs found

Quickly identifying identical and closely related subjects in large databases using genotype data

Author: Alejandro A. Schäffer (110833)
Michael Feolo (264871)
Stephen T. Sherry (264870)
Yumi Jin (4107445)
Publication venue
Publication date: 13/06/2017
Field of study

<div>Genome-wide association studies (GWAS) usually rely on the assumption that different samples are not from closely related individuals. Detection of duplicates and close relatives becomes more difficult both statistically and computationally when one wants to combine datasets that may have been genotyped on different platforms. The dbGaP repository at the National Center of Biotechnology Information (NCBI) contains datasets from hundreds of studies with over one million samples. There are many duplicates and closely related individuals both within and across studies from different submitters. Relationships between studies cannot always be identified by the submitters of individual datasets. To aid in curation of dbGaP, we developed a rapid statistical method called Genetic Relationship and Fingerprinting (GRAF) to detect duplicates and closely related samples, even when the sets of genotyped markers differ and the DNA strand orientations are unknown. GRAF extracts genotypes of 10,000 informative and independent SNPs from genotype datasets obtained using different methods, and implements quick algorithms that enable it to find all of the duplicate pairs from more than 880,000 samples within and across dbGaP studies in less than two hours. In addition, GRAF uses two statistical metrics called All Genotype Mismatch Rate (AGMR) and Homozygous Genotype Mismatch Rate (HGMR) to determine subject relationships directly from the observed genotypes, without estimating probabilities of identity by descent (IBD), or kinship coefficients, and compares the predicted relationships with those reported in the pedigree files. We implemented GRAF in a freely available C++ program of the same name. In this paper, we describe the methods in GRAF and validate the usage of GRAF on samples from the dbGaP repository. Other scientists can use GRAF on their own samples and in combination with samples downloaded from dbGaP.</div

Directory of Open Access Journals

FigShare

Probabilities of different IBD and IBS states for different relationships.

Author: Alejandro A. Schäffer (110833)
Michael Feolo (264871)
Stephen T. Sherry (264870)
Yumi Jin (4107445)
Publication venue
Publication date
Field of study

Probabilities of different IBD and IBS states for different relationships.</p

FigShare

Running times and prediction accuracies of the sub-quadratic algorithm tested with datasets of different sample sizes and genotype missing rates.

Author: Alejandro A. Schäffer (110833)
Michael Feolo (264871)
Stephen T. Sherry (264870)
Yumi Jin (4107445)
Publication venue
Publication date
Field of study

Running times and prediction accuracies of the sub-quadratic algorithm tested with datasets of different sample sizes and genotype missing rates.</p

FigShare

Comparison of the performances of the GRAF quadratic algorithm and KING 2.0 on finding identical pairs with different numbers of SNPs with genotypes.

Author: Alejandro A. Schäffer (110833)
Michael Feolo (264871)
Stephen T. Sherry (264870)
Yumi Jin (4107445)
Publication venue
Publication date
Field of study

Comparison of the performances of the GRAF quadratic algorithm and KING 2.0 on finding identical pairs with different numbers of SNPs with genotypes.</p

FigShare

Mean HGMR and AGMR values and correlation coefficients between HGMR and AGMR of all related subjects reported in the data files submitted to dbGaP.

Author: Alejandro A. Schäffer (110833)
Michael Feolo (264871)
Stephen T. Sherry (264870)
Yumi Jin (4107445)
Publication venue
Publication date
Field of study

Mean HGMR and AGMR values and correlation coefficients between HGMR and AGMR of all related subjects reported in the data files submitted to dbGaP.</p

FigShare

Comparison of the performances of the GRAF quadratic algorithm and KING 2.0 on finding identical pairs with different AGMR ranges.

Author: Alejandro A. Schäffer (110833)
Michael Feolo (264871)
Stephen T. Sherry (264870)
Yumi Jin (4107445)
Publication venue
Publication date
Field of study

Comparison of the performances of the GRAF quadratic algorithm and KING 2.0 on finding identical pairs with different AGMR ranges.</p

FigShare

An example of GRAF results displayed on dbGaP website for curators and data submitters to find discrepancies between genotypes and pedigree files submitted to dbGaP.

Author: Alejandro A. Schäffer (110833)
Michael Feolo (264871)
Stephen T. Sherry (264870)
Yumi Jin (4107445)
Publication venue
Publication date
Field of study

The graphs show GRAF results of one dbGaP study before the errors were corrected by the submitter. Relationships reported by the submitter are color coded. (A) Distribution of AGMR values. Coral red: Duplicate samples; Purple: monozygotic twins; Blue: first, second, or third degree relative; Gray: no relationship reported by submitter. (B) Distribution of HGMR values. Red: parent/offspring; Blue: full sibling; Green: second degree relative; Yellow: third degree relative; Grey: no relationship reported by submitter. (C) Distribution of both HGMR and AGMR values of pairs of related samples, excluding those from same subjects or monozygotic twins. Red: parent/offspring; Blue: full sibling; Green: second degree relative; Yellow: third degree relative; Gray: no relationship reported by submitter.</p

FigShare

Comparison of GRAF and KING on determining subject relationships for four dbGaP studies.

Author: Alejandro A. Schäffer (110833)
Michael Feolo (264871)
Stephen T. Sherry (264870)
Yumi Jin (4107445)
Publication venue
Publication date
Field of study

Relationships self-reported in the pedigree files are color coded: red = parent-offspring; blue = full sibling; green = second degree; deep yellow = third degree. Cyan lines show the cutoff values to separate different types of relationships from one another.</p

FigShare

Probabilities of shared alleles in pairwise sample comparisons for autosomal bi-allelic markers are derived from the list of genotype outcomes.

Author: Alejandro A. Schäffer (110833)
Michael Feolo (264871)
Stephen T. Sherry (264870)
Yumi Jin (4107445)
Publication venue
Publication date
Field of study

When no alleles are shared by descent (Z) (panel A, Z = 0), then the chance of seeing any specific combination of alleles is the product of the respective allele frequencies. When one (panel B, Z = 1) or both alleles (panel C, Z = 2) are shared by descent, then the possible number of genotype outcomes are reduced. The number of alleles identical by state (I) can be zero (panel A, lavender), one (all panels, no highlight), or two (all panels, green).</p

FigShare

Calculation of P(I|Z) value for each marker.

Author: Alejandro A. Schäffer (110833)
Michael Feolo (264871)
Stephen T. Sherry (264870)
Yumi Jin (4107445)
Publication venue
Publication date
Field of study

Calculation of P(I|Z) value for each marker.</p

FigShare