24 research outputs found

    Inference of Relationships in Population Data Using Identity-by-Descent and Identity-by-State

    Get PDF
    It is an assumption of large, population-based datasets that samples are annotated accurately whether they correspond to known relationships or unrelated individuals. These annotations are key for a broad range of genetics applications. While many methods are available to assess relatedness that involve estimates of identity-by-descent (IBD) and/or identity-by-state (IBS) allele-sharing proportions, we developed a novel approach that estimates IBD0, 1, and 2 based on observed IBS within windows. When combined with genome-wide IBS information, it provides an intuitive and practical graphical approach with the capacity to analyze datasets with thousands of samples without prior information about relatedness between individuals or haplotypes. We applied the method to a commonly used Human Variation Panel consisting of 400 nominally unrelated individuals. Surprisingly, we identified identical, parent-child, and full-sibling relationships and reconstructed pedigrees. In two instances non-sibling pairs of individuals in these pedigrees had unexpected IBD2 levels, as well as multiple regions of homozygosity, implying inbreeding. This combined method allowed us to distinguish related individuals from those having atypical heterozygosity rates and determine which individuals were outliers with respect to their designated population. Additionally, it becomes increasingly difficult to identify distant relatedness using genome-wide IBS methods alone. However, our IBD method further identified distant relatedness between individuals within populations, supported by the presence of megabase-scale regions lacking IBS0 across individual chromosomes. We benchmarked our approach against the hidden Markov model of a leading software package (PLINK), showing improved calling of distantly related individuals, and we validated it using a known pedigree from a clinical study. The application of this approach could improve genome-wide association, linkage, heterozygosity, and other population genomics studies that rely on SNP genotype data

    Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples.

    No full text
    The adoption of whole-genome sequencing within the public health realm for molecular characterization of bacterial pathogens has been followed by an increased emphasis on real-time detection of emerging outbreaks (e.g., food-borne Salmonellosis). In turn, large databases of whole-genome sequence data are being populated. These databases currently contain tens of thousands of samples and are expected to grow to hundreds of thousands within a few years. For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genetic distances among a set of samples. Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. We evaluated seven measures of genetic distance, which were estimated from either k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) or nucleotide sites (NUCmer and an extended multi-locus sequence typing (MLST) scheme). When analyzing empirical data (whole-genome sequence data from 18,997 Salmonella isolates) there are features (e.g., genomic, assembly, and contamination) that cause distances inferred from k-mer profiles, which treat absent data as informative, to fail to accurately capture the distance between samples when compared to distances inferred from differences in nucleotide sites. Thus, site-based distances, like NUCmer and extended MLST, are superior in performance, but accessing the computing resources necessary to perform them may be challenging when analyzing large databases

    Unexpected Relationships and Inbreeding in HapMap Phase III Populations

    Get PDF
    <div><p>Correct annotation of the genetic relationships between samples is essential for population genomic studies, which could be biased by errors or omissions. To this end, we used identity-by-state (IBS) and identity-by-descent (IBD) methods to assess genetic relatedness of individuals within HapMap phase III data. We analyzed data from 1,397 individuals across 11 ethnic populations. Our results support previous studies (Pemberton et al., 2010; Kyriazopoulou-Panagiotopoulou et al., 2011) assessing unknown relatedness present within this population. Additionally, we present evidence for 1,657 novel pairwise relationships across 9 populations. Surprisingly, significant Cotterman's coefficients of relatedness K1 (IBD1) values were detected between pairs of known parents. Furthermore, significant K2 (IBD2) values were detected in 32 previously annotated parent-child relationships. Consistent with a hypothesis of inbreeding, regions of homozygosity (ROH) were identified in the offspring of related parents, of which a subset overlapped those reported in previous studies (Gibson et al. 2010; Johnson et al. 2011). In total, we inferred 28 inbred individuals with ROH that overlapped areas of relatedness between the parents and/or IBD2 sharing at a different genomic locus between a child and a parent. Finally, 8 previously annotated parent-child relationships had unexpected K0 (IBD0) values (resulting from a chromosomal abnormality or genotype error), and 10 previously annotated second-degree relationships along with 38 other novel pairwise relationships had unexpected IBD2 (indicating two separate paths of recent ancestry). These newly described types of relatedness may impact the outcome of previous studies and should inform the design of future studies relying on the HapMap Phase III resource.</p></div

    IBD estimates of previously annotated and novel relatedness in phase III HapMap.

    No full text
    <p>Each circle represents a pair of individuals with estimated Cotterman coefficients of relatedness K0, K1, and K2 (percent of the genome shared IBD0, IBD1, and IBD2). (A) Previously annotated relationships given by the International HapMap Consortium <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0049575#pone.0049575-InternationalHapMap1" target="_blank">[2]</a>, Pemberton et al. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0049575#pone.0049575-Pemberton1" target="_blank">[19]</a>, and Kyriazopoulou-Panagiotopoulou et al. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0049575#pone.0049575-KyriazopoulouPanagiotopoulou1" target="_blank">[20]</a> were plotted by group (x-axis) and K1 values (y-axis) and labeled by their degree of relationship. Arrow 1 corresponds to identical samples NA21737/NA21344. (B) Unexpected K2 values (y-axis) in previously annotated parent-child and second-degree relatedness for each group (x-axis). Only K2 values greater than 0.001 are shown. Arrow 2 corresponds to NA21362/NA21438. (C) Estimated IBD0 (y-axis) in previously annotated parent-child relationships for each group (x-axis). Only K0 values greater than 0.001 are shown. Arrows 3–5 highlight NA12874/NA12865, NA12889/NA12877, and NA10863/NA12234, respectively. Only K2 values greater than 0.001 are shown. (D) Novel relatedness between pairs of individuals separated by group (x-axis) and estimated K1 (y-axis). Only K1 values greater than 0.025 are shown. (E) Novel relatedness between pairs of individuals previously identified in Panel B for MKK and MXL (x-axis) with unexpected K2 (y-axis). (F) Inferred degrees of relationship (including those unable to be called; x-axis) plotted as a function of K1. All 2260 pairwise comparisons inferred to be related from any study (including this one) are shown, excluding identical samples. Note the overlap between percent of genome shared IBD1 and degree of relationship. Abbreviation: NC, no relationship called; r value, relatedness value.</p

    Reconstruction of a partial pedigree from the MKK group.

    No full text
    <p>We analyzed MKK genotype data using IBD analysis and inferred the familial relationships of 61 individuals with 46 being related to at least 1 other person. This graph contains relationships constructed from second-degree, full-sibling, parent-child, and identical relationships (with the exception of NA21352 and NA21351 who are inferred to be first-cousins based on their second-degree relationship to NA21414; see top left of figure). All indicated relationships are based on previous analysis (siblings: thick green lines), previous annotation (family trios; family ID), and inferred analyses (sibling relationships, thick blue lines; corrected parent-child orientation, thick red lines; corrections made to annotated relationships, thick yellow lines; other familial relationships; thin black lines). Dashed rectangles indicate family units annotated by the HapMap project at the Coriell website. F indicates family identifier (e.g. F2654). Individual identifiers are shown as the last three digits of NA21xxx (e.g. 353 at the upper left of the figure corresponds to individual NA21353). All IBD information is given in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0049575#pone.0049575.s013" target="_blank">Table S1</a>. Note that several individuals who are part of MKK (e.g. NA12310 in family 2566) and for whom cell lines were created did not have SNP data as part of the HapMap Phase III release.</p

    Evidence for consanguinity in HapMap Phase III individuals.

    No full text
    <p>Pairwise comparisons of IBS were plotted across a chromosome by position for pairs of individuals that had unexpected IBD1 and IBD2 for their relationship type. (A) IBS observations for two parents (YRI father/mother NA18504/NA18505) are shown for chromosome 4. Note region 1 which indicates an absence of IBS0 calls and inferred IBD1 status. (B) IBS measurements between father and son (NA18504/NA18503) are plotted for chromosome 4. Note region 2 in which there are few IBS0 and IBS1 calls thus implying IBD2 status. (C). Genotypes of the son (NA18503) are shown for chromosome 4. Note region 3 in which there is a lack of AB calls, aligning with region 1, thus indicating autozygosity. (D) Ideogram for chromosome 4. (E) IBS observations between two YRI parents (father/mother NA19121/NA19122) are plotted along chromosome 20. Note region 1 in which there is a lack of IBS0 calls indicating an IBD1 region. (F) Genotypes of the son (NA19123) are shown for chromosome 20. Note region 1 in which there are zero AB calls in the same region of IBD1 between the parents implying autozygosity in the child. (G) Ideogram for chromosome 20.</p
    corecore