1,172 research outputs found

    Novel variation and <i>de novo </i>mutation rates in population-wide <i>de novo</i> assembled Danish trios

    Get PDF
    Building a population-specific catalogue of single nucleotide variants (SNVs), indels and structural variants (SVs) with frequencies, termed a national pan-genome, is critical for further advancing clinical and public health genetics in large cohorts. Here we report a Danish pan-genome obtained from sequencing 10 trios to high depth (50 × ). We report 536k novel SNVs and 283k novel short indels from mapping approaches and develop a population-wide de novo assembly approach to identify 132k novel indels larger than 10 nucleotides with low false discovery rates. We identify a higher proportion of indels and SVs than previous efforts showing the merits of high coverage and de novo assembly approaches. In addition, we use trio information to identify de novo mutations and use a probabilistic method to provide direct estimates of 1.27e−8 and 1.5e−9 per nucleotide per generation for SNVs and indels, respectively

    A 454 multiplex sequencing method for rapid and reliable genotyping of highly polymorphic genes in large-scale studies

    Get PDF
    Background: High-throughput sequencing technologies offer new perspectives for biomedical, agronomical and evolutionary research. Promising progresses now concern the application of these technologies to large-scale studies of genetic variation. Such studies require the genotyping of high numbers of samples. This is theoretically possible using 454 pyrosequencing, which generates billions of base pairs of sequence data. However several challenges arise: first in the attribution of each read produced to its original sample, and second, in bioinformatic analyses to distinguish true from artifactual sequence variation. This pilot study proposes a new application for the 454 GS FLX platform, allowing the individual genotyping of thousands of samples in one run. A probabilistic model has been developed to demonstrate the reliability of this method. Results: DNA amplicons from 1,710 rodent samples were individually barcoded using a combination of tags located in forward and reverse primers. Amplicons consisted in 222 bp fragments corresponding to DRB exon 2, a highly polymorphic gene in mammals. A total of 221,789 reads were obtained, of which 153,349 were finally assigned to original samples. Rules based on a probabilistic model and a four-step procedure, were developed to validate sequences and provide a confidence level for each genotype. The method gave promising results, with the genotyping of DRB exon 2 sequences for 1,407 samples from 24 different rodent species and the sequencing of 392 variants in one half of a 454 run. Using replicates, we estimated that the reproducibility of genotyping reached 95%. Conclusions: This new approach is a promising alternative to classical methods involving electrophoresis-based techniques for variant separation and cloning-sequencing for sequence determination. The 454 system is less costly and time consuming and may enhance the reliability of genotypes obtained when high numbers of samples are studied. It opens up new perspectives for the study of evolutionary and functional genetics of highly polymorphic genes like major histocompatibility complex genes in vertebrates or loci regulating self-compatibility in plants. Important applications in biomedical research will include the detection of individual variation in disease susceptibility. Similarly, agronomy will benefit from this approach, through the study of genes implicated in productivity or disease susceptibility trait

    SNPest:a probabilistic graphical model for estimating genotypes

    Get PDF
    BACKGROUND: As the use of next-generation sequencing technologies is becoming more widespread, the need for robust software to help with the analysis is growing as well. A key challenge when analyzing sequencing data is the prediction of genotypes from the reads, i.e. correct inference of the underlying DNA sequences that gave rise to the sequenced fragments. For diploid organisms, the genotyper should be able to predict both alleles in the individual. Variations between the individual and the population can then be analyzed by looking for SNPs (single nucleotide polymorphisms) in order to investigate diseases or phenotypic features. To perform robust and high confidence genotyping and SNP calling, methods are needed that take the technology specific limitations into account and can model different sources of error. As an example, ancient DNA poses special challenges as the data is often shallow and subject to errors induced by post mortem damage. FINDINGS: We present a novel approach to the genotyping problem where a probabilistic framework describing the process from sampling to sequencing is implemented as a graphical model. This makes it possible to model technology specific errors and other sources of variation that can affect the result. The inferred genotype is given a posterior probability to signify the confidence in the result. SNPest has already been used to genotype large scale projects such as the first ancient human genome published in 2010. CONCLUSIONS: We compare the performance of SNPest to a number of other widely used genotypers on both real and simulated data, covering both haploid and diploid genomes. We investigate the effects of read depth, of removing adapters before mapping and genotyping, of using different mapping tools, and of using the correct model in the genotyping process. We show that the performance of SNPest is comparable to existing methods, and we also illustrate cases where SNPest has an advantage over other methods, e.g. when dealing with simulated ancient DNA. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1756-0500-7-698) contains supplementary material, which is available to authorized users

    Exploiting the great potential of Sequence Capture data by a new tool, SUPER-CAP

    Get PDF
    The recent development of Sequence Capture methodology represents a powerful strategy for enhancing data generation to assess genetic variation of targeted genomic regions. Here, we present SUPER-CAP, a bioinformatics web tool aimed at handling Sequence Capture data, fine calculating the allele frequency of variations and building genotype-specific sequence of captured genes. The dataset used to develop this in silico strategy consists of 378 loci and related regulative regions in a collection of 44 tomato landraces. About 14,000 high-quality variants were identified. The high depth (>40×) of coverage and adopting the correct filtering criteria allowed identification of about 4,000 rare variants and 10 genes with a different copy number variation. We also show that the tool is capable to reconstruct genotype-specific sequences for each genotype by using the detected variants. This allows evaluating the combined effect of multiple variants in the same protein. The architecture and functionality of SUPER-CAP makes the software appropriate for a broad set of analyses including SNP discovery and mining. Its functionality, together with the capability to process large data sets and efficient detection of sequence variation, makes SUPER-CAP a valuable bioinformatics tool for genomics and breeding purposes

    Open Access

    Get PDF
    Global spatiotemporal and genetic footprint o

    Mining for Structural Variations in Next-Generation Sequencing Data

    Get PDF
    Genomic structural variations (SVs) are genetic alterations that result in duplications, insertions, deletions, inversions, and translocations of segments of DNA covering 50 or more base pairs. By changing the organization of DNA, SVs can contribute to phenotypic variation or cause pathological consequences as neurobehavioral disorders, autoimmune diseases, obesity, and cancers. SVs were first examined using classic cytogenetic methods, revealing changes down to 3 Mb. Later techniques for SV detection were based on array comparative genome hybridization (aCGH) and single-nucleotide polymorphism (SNP) arrays. Next-generation sequencing (NGS) approaches enabled precise characterization of breakpoints of SVs of various types and sizes at a genome-wide scale. Dissecting SVs from NGS presents substantial challenge due to the relatively short sequence reads and the large volume of the data. Benign variants and reference errors in the genome present another dimension of problem complexity. Even though a wide range of tools is available, the usage of SV callers in routine molecular diagnostic is still limited. SV detection algorithms relay on different properties of the underlying data and vary in accuracy and sensitivity; therefore, SV detection process usually utilizes multiple variant callers. This chapter summarizes strengths and limitations of different tools in effective NGS SV calling

    Graph pangenome captures missing heritability and empowers tomato breeding

    Get PDF
    Missing heritability in genome-wide association studies defines a major problem in genetic analyses of complex biological traits(1,2). The solution to this problem is to identify all causal genetic variants and to measure their individual contributions(3,4). Here we report a graph pangenome of tomato constructed by precisely cataloguing more than 19 million variants from 838 genomes, including 32 new reference-level genome assemblies. This graph pangenome was used forgenome-wide association study analyses and heritability estimation of 20,323 gene-expression and metabolite traits. The average estimated trait heritability is 0.41 compared with 0.33 when using the single linear reference genome. This 24% increase in estimated heritability is largely due to resolving incomplete linkage disequilibrium through the inclusion of additional causal structural variants identified using the graph pangenome. Moreover, by resolving allelic and locus heterogeneity, structural variants improve the power to identify genetic factors underlying agronomically important traits leading to, for example, the identification of two new genes potentially contributing to soluble solid content. The newly identified structural variants will facilitate genetic improvement of tomato through both marker-assisted selection and genomic selection. Our study advances the understanding of the heritability of complex traits and demonstrates the power of the graph pangenome in crop breeding

    MALVA: Genotyping by Mapping-free ALlele Detection of Known VAriants

    Get PDF
    The amount of genetic variation discovered in human populations is growing rapidly leading to challenging computational tasks, such as variant calling. Standard methods for addressing this problem include read mapping, a computationally expensive procedure; thus, mapping-free tools have been proposed in recent years. These tools focus on isolated, biallelic SNPs, providing limited support for multi-allelic SNPs and short insertions and deletions of nucleotides (indels). Here we introduce MALVA, a mapping-free method to genotype an individual from a sample of reads. MALVA is the first mapping-free tool able to genotype multi-allelic SNPs and indels, even in high-density genomic regions, and to effectively handle a huge number of variants. MALVA requires one order of magnitude less time to genotype a donor than alignment-based pipelines, providing similar accuracy. Remarkably, on indels, MALVA provides even better results than the most widely adopted variant discovery tools. Biological Sciences; Genetics; Genomics; Bioinformatic

    GCAT|Panel, a comprehensive structural variant haplotype map of the Iberian population from high-coverage whole-genome sequencing

    Get PDF
    The combined analysis of haplotype panels with phenotype clinical cohorts is a common approach to explore the genetic architecture of human diseases. However, genetic studies are mainly based on single nucleotide variants (SNVs) and small insertions and deletions (indels). Here, we contribute to fill this gap by generating a dense haplotype map focused on the identification, characterization, and phasing of structural variants (SVs). By integrating multiple variant identification methods and Logistic Regression Models (LRMs), we present a catalogue of 35 431 441 variants, including 89 178 SVs (≥50 bp), 30 325 064 SNVs and 5 017 199 indels, across 785 Illumina high coverage (30x) whole-genomes from the Iberian GCAT Cohort, containing a median of 3.52M SNVs, 606 336 indels and 6393 SVs per individual. The haplotype panel is able to impute up to 14 360 728 SNVs/indels and 23 179 SVs, showing a 2.7-fold increase for SVs compared with available genetic variation panels. The value of this panel for SVs analysis is shown through an imputed rare Alu element located in a new locus associated with Mononeuritis of lower limb, a rare neuromuscular disease. This study represents the first deep characterization of genetic variation within the Iberian population and the first operational haplotype panel to systematically include the SVs into genome-wide genetic studies
    corecore