41 research outputs found

    Quickly identifying identical and closely related subjects in large databases using genotype data

    No full text
    <div><p>Genome-wide association studies (GWAS) usually rely on the assumption that different samples are not from closely related individuals. Detection of duplicates and close relatives becomes more difficult both statistically and computationally when one wants to combine datasets that may have been genotyped on different platforms. The dbGaP repository at the National Center of Biotechnology Information (NCBI) contains datasets from hundreds of studies with over one million samples. There are many duplicates and closely related individuals both within and across studies from different submitters. Relationships between studies cannot always be identified by the submitters of individual datasets. To aid in curation of dbGaP, we developed a rapid statistical method called Genetic Relationship and Fingerprinting (GRAF) to detect duplicates and closely related samples, even when the sets of genotyped markers differ and the DNA strand orientations are unknown. GRAF extracts genotypes of 10,000 informative and independent SNPs from genotype datasets obtained using different methods, and implements quick algorithms that enable it to find all of the duplicate pairs from more than 880,000 samples within and across dbGaP studies in less than two hours. In addition, GRAF uses two statistical metrics called All Genotype Mismatch Rate (AGMR) and Homozygous Genotype Mismatch Rate (HGMR) to determine subject relationships directly from the observed genotypes, without estimating probabilities of identity by descent (IBD), or kinship coefficients, and compares the predicted relationships with those reported in the pedigree files. We implemented GRAF in a freely available C++ program of the same name. In this paper, we describe the methods in GRAF and validate the usage of GRAF on samples from the dbGaP repository. Other scientists can use GRAF on their own samples and in combination with samples downloaded from dbGaP.</p></div

    Probabilities of different IBD and IBS states for different relationships.

    No full text
    <p>Probabilities of different IBD and IBS states for different relationships.</p

    Running times and prediction accuracies of the sub-quadratic algorithm tested with datasets of different sample sizes and genotype missing rates.

    No full text
    <p>Running times and prediction accuracies of the sub-quadratic algorithm tested with datasets of different sample sizes and genotype missing rates.</p

    Comparison of the performances of the GRAF quadratic algorithm and KING 2.0 on finding identical pairs with different numbers of SNPs with genotypes.

    No full text
    <p>Comparison of the performances of the GRAF quadratic algorithm and KING 2.0 on finding identical pairs with different numbers of SNPs with genotypes.</p

    Mean HGMR and AGMR values and correlation coefficients between HGMR and AGMR of all related subjects reported in the data files submitted to dbGaP.

    No full text
    <p>Mean HGMR and AGMR values and correlation coefficients between HGMR and AGMR of all related subjects reported in the data files submitted to dbGaP.</p

    Comparison of the performances of the GRAF quadratic algorithm and KING 2.0 on finding identical pairs with different AGMR ranges.

    No full text
    <p>Comparison of the performances of the GRAF quadratic algorithm and KING 2.0 on finding identical pairs with different AGMR ranges.</p

    An example of GRAF results displayed on dbGaP website for curators and data submitters to find discrepancies between genotypes and pedigree files submitted to dbGaP.

    No full text
    <p>The graphs show GRAF results of one dbGaP study before the errors were corrected by the submitter. Relationships reported by the submitter are color coded. (A) Distribution of AGMR values. Coral red: Duplicate samples; Purple: monozygotic twins; Blue: first, second, or third degree relative; Gray: no relationship reported by submitter. (B) Distribution of HGMR values. Red: parent/offspring; Blue: full sibling; Green: second degree relative; Yellow: third degree relative; Grey: no relationship reported by submitter. (C) Distribution of both HGMR and AGMR values of pairs of related samples, excluding those from same subjects or monozygotic twins. Red: parent/offspring; Blue: full sibling; Green: second degree relative; Yellow: third degree relative; Gray: no relationship reported by submitter.</p

    Comparison of GRAF and KING on determining subject relationships for four dbGaP studies.

    No full text
    <p>Relationships self-reported in the pedigree files are color coded: red = parent-offspring; blue = full sibling; green = second degree; deep yellow = third degree. Cyan lines show the cutoff values to separate different types of relationships from one another.</p

    Probabilities of shared alleles in pairwise sample comparisons for autosomal bi-allelic markers are derived from the list of genotype outcomes.

    No full text
    <p>When no alleles are shared by descent (<i>Z</i>) (panel A, <i>Z</i> = 0), then the chance of seeing any specific combination of alleles is the product of the respective allele frequencies. When one (panel B, <i>Z</i> = 1) or both alleles (panel C, <i>Z</i> = 2) are shared by descent, then the possible number of genotype outcomes are reduced. The number of alleles identical by state (<i>I</i>) can be zero (panel A, lavender), one (all panels, no highlight), or two (all panels, green).</p
    corecore