7,734 research outputs found

    A Poisson hierarchical modelling approach to detecting copy number variation in sequence coverage data.

    Get PDF
    BACKGROUND: The advent of next generation sequencing technology has accelerated efforts to map and catalogue copy number variation (CNV) in genomes of important micro-organisms for public health. A typical analysis of the sequence data involves mapping reads onto a reference genome, calculating the respective coverage, and detecting regions with too-low or too-high coverage (deletions and amplifications, respectively). Current CNV detection methods rely on statistical assumptions (e.g., a Poisson model) that may not hold in general, or require fine-tuning the underlying algorithms to detect known hits. We propose a new CNV detection methodology based on two Poisson hierarchical models, the Poisson-Gamma and Poisson-Lognormal, with the advantage of being sufficiently flexible to describe different data patterns, whilst robust against deviations from the often assumed Poisson model. RESULTS: Using sequence coverage data of 7 Plasmodium falciparum malaria genomes (3D7 reference strain, HB3, DD2, 7G8, GB4, OX005, and OX006), we showed that empirical coverage distributions are intrinsically asymmetric and overdispersed in relation to the Poisson model. We also demonstrated a low baseline false positive rate for the proposed methodology using 3D7 resequencing data and simulation. When applied to the non-reference isolate data, our approach detected known CNV hits, including an amplification of the PfMDR1 locus in DD2 and a large deletion in the CLAG3.2 gene in GB4, and putative novel CNV regions. When compared to the recently available FREEC and cn.MOPS approaches, our findings were more concordant with putative hits from the highest quality array data for the 7G8 and GB4 isolates. CONCLUSIONS: In summary, the proposed methodology brings an increase in flexibility, robustness, accuracy and statistical rigour to CNV detection using sequence coverage data

    PolyTB: a genomic variation map for Mycobacterium tuberculosis

    Get PDF
    Tuberculosis (TB) caused by Mycobacterium tuberculosis (Mtb) is the second major cause of death from an infectious disease worldwide. Recent advances in DNA sequencing are leading to the ability to generate whole genome information in clinical isolates of M. tuberculosis complex (MTBC). The identification of informative genetic variants such as phylogenetic markers and those associated with drug resistance or virulence will help barcode Mtb in the context of epidemiological, diagnostic and clinical studies. Mtb genomic datasets are increasingly available as raw sequences, which are potentially difficult and computer intensive to process, and compare across studies. Here we have processed the raw sequence data (>1500 isolates, eight studies) to compile a catalogue of SNPs (n = 74,039, 63% non-synonymous, 51.1% in more than one isolate, i.e. non-private), small indels (n = 4810) and larger structural variants (n = 800). We have developed the PolyTB web-based tool (http://pathogenseq.lshtm.ac.uk/polytb) to visualise the resulting variation and important meta-data (e.g. in silico inferred strain-types, location) within geographical map and phylogenetic views. This resource will allow researchers to identify polymorphisms within candidate genes of interest, as well as examine the genomic diversity and distribution of strains. PolyTB source code is freely available to researchers wishing to develop similar tools for their pathogen of interest

    CSGM Designer: a platform for designing cross-species intron-spanning genic markers linked with genome information of legumes.

    Get PDF
    BackgroundGenetic markers are tools that can facilitate molecular breeding, even in species lacking genomic resources. An important class of genetic markers is those based on orthologous genes, because they can guide hypotheses about conserved gene function, a situation that is well documented for a number of agronomic traits. For under-studied species a key bottleneck in gene-based marker development is the need to develop molecular tools (e.g., oligonucleotide primers) that reliably access genes with orthology to the genomes of well-characterized reference species.ResultsHere we report an efficient platform for the design of cross-species gene-derived markers in legumes. The automated platform, named CSGM Designer (URL: http://tgil.donga.ac.kr/CSGMdesigner), facilitates rapid and systematic design of cross-species genic markers. The underlying database is composed of genome data from five legume species whose genomes are substantially characterized. Use of CSGM is enhanced by graphical displays of query results, which we describe as "circular viewer" and "search-within-results" functions. CSGM provides a virtual PCR representation (eHT-PCR) that predicts the specificity of each primer pair simultaneously in multiple genomes. CSGM Designer output was experimentally validated for the amplification of orthologous genes using 16 genotypes representing 12 crop and model legume species, distributed among the galegoid and phaseoloid clades. Successful cross-species amplification was obtained for 85.3% of PCR primer combinations.ConclusionCSGM Designer spans the divide between well-characterized crop and model legume species and their less well-characterized relatives. The outcome is PCR primers that target highly conserved genes for polymorphism discovery, enabling functional inferences and ultimately facilitating trait-associated molecular breeding

    MODBASE, a database of annotated comparative protein structure models and associated resources.

    Get PDF
    MODBASE (http://salilab.org/modbase) is a database of annotated comparative protein structure models. The models are calculated by MODPIPE, an automated modeling pipeline that relies primarily on MODELLER for fold assignment, sequence-structure alignment, model building and model assessment (http:/salilab.org/modeller). MODBASE currently contains 5,152,695 reliable models for domains in 1,593,209 unique protein sequences; only models based on statistically significant alignments and/or models assessed to have the correct fold are included. MODBASE also allows users to calculate comparative models on demand, through an interface to the MODWEB modeling server (http://salilab.org/modweb). Other resources integrated with MODBASE include databases of multiple protein structure alignments (DBAli), structurally defined ligand binding sites (LIGBASE), predicted ligand binding sites (AnnoLyze), structurally defined binary domain interfaces (PIBASE) and annotated single nucleotide polymorphisms and somatic mutations found in human proteins (LS-SNP, LS-Mut). MODBASE models are also available through the Protein Model Portal (http://www.proteinmodelportal.org/)

    Examining the efficacy of a genotyping-by-sequencing technique for population genetic analysis of the mushroom Laccaria bicolor and evaluating whether a reference genome is necessary to assess homology

    Get PDF
    Given the diversity and ecological importance of Fungi, there is a lack of population genetic research on these organisms. The reason for this can be explained in part by their cryptic nature and difficulty in identifying genets. In addition the difficulty (relative to plants and animals) in developing molecular markers for fungal population genetics contributes to the lack of research in this area. This study examines the ability of restriction-site associated DNA (RAD) sequencing to generate SNPs in Laccaria bicolor. Eighteen samples of morphologically identified L. bicolor from the United States and Europe were selected for this project. The RAD sequencing method produced anywhere from 290 000 to more than 3 000 000 reads. Mapping these reads to the genome of L. bicolor resulted in 84 000-940 000 unique reads from individual samples. Results indicate that incorporation of non-L. bicolor taxa into the analysis resulted in a precipitous drop in shared loci among samples, suggests the potential of these methods to identify cryptic species. F-statistics were easily calculated, although an observable "noise" was detected when using the "All Loci" treatment versus filtering loci to those present in at least 50% of the individuals. The data were analyzed with tests of Hardy-Weinburg equilibrium, population genetic statistics (FIS and FST), and population structure analysis using the program Structure. The results provide encouraging feedback regarding the potential utility of these methods and their data for population genetic analysis. We were unable to draw conclusions of life history of L. bicolor populations from this dataset, given the small sample size. The results of this study indicate the potential of these methods to address population genetics and general life history questions in the Agaricales. Further research is necessary to explore the specific application of these methods in the Agaricales or other fungal groups

    Polymorphism identification and improved genome annotation of Brassica rapa through Deep RNA sequencing.

    Get PDF
    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes-R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)-using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/

    A fast algorithm for detecting gene-gene interactions in genome-wide association studies

    Full text link
    With the recent advent of high-throughput genotyping techniques, genetic data for genome-wide association studies (GWAS) have become increasingly available, which entails the development of efficient and effective statistical approaches. Although many such approaches have been developed and used to identify single-nucleotide polymorphisms (SNPs) that are associated with complex traits or diseases, few are able to detect gene-gene interactions among different SNPs. Genetic interactions, also known as epistasis, have been recognized to play a pivotal role in contributing to the genetic variation of phenotypic traits. However, because of an extremely large number of SNP-SNP combinations in GWAS, the model dimensionality can quickly become so overwhelming that no prevailing variable selection methods are capable of handling this problem. In this paper, we present a statistical framework for characterizing main genetic effects and epistatic interactions in a GWAS study. Specifically, we first propose a two-stage sure independence screening (TS-SIS) procedure and generate a pool of candidate SNPs and interactions, which serve as predictors to explain and predict the phenotypes of a complex trait. We also propose a rates adjusted thresholding estimation (RATE) approach to determine the size of the reduced model selected by an independence screening. Regularization regression methods, such as LASSO or SCAD, are then applied to further identify important genetic effects. Simulation studies show that the TS-SIS procedure is computationally efficient and has an outstanding finite sample performance in selecting potential SNPs as well as gene-gene interactions. We apply the proposed framework to analyze an ultrahigh-dimensional GWAS data set from the Framingham Heart Study, and select 23 active SNPs and 24 active epistatic interactions for the body mass index variation. It shows the capability of our procedure to resolve the complexity of genetic control.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS771 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Population genomic structure of the gelatinous zooplankton species Mnemiopsis leidyi in its nonindigenous range in the North Sea

    Get PDF
    Nonindigenous species pose a major threat for coastal and estuarine ecosystems. Risk management requires genetic information to establish appropriate management units and infer introduction and dispersal routes. We investigated one of the most successful marine invaders, the ctenophore Mnemiopsis leidyi, and used genotyping-by-sequencing (GBS) to explore the spatial population structure in its nonindigenous range in the North Sea. We analyzed 140 specimens collected in different environments, including coastal and estuarine areas, and ports along the coast. Single nucleotide polymorphisms (SNPs) were called in approximately 40 k GBS loci. Population structure based on the neutral SNP panel was significant (F-ST .02; p < .01), and a distinct genetic cluster was identified in a port along the Belgian coast (Ostend port; pairwise F-ST .02-.04; p < .01). Remarkably, no population structure was detected between geographically distant regions in the North Sea (the Southern part of the North Sea vs. the Kattegat/Skagerrak region), which indicates substantial gene flow at this geographical scale and recent population expansion of nonindigenous M. leidyi. Additionally, seven specimens collected at one location in the indigenous range (Chesapeake Bay, USA) were highly differentiated from the North Sea populations (pairwise F-ST .36-.39; p < .01). This study demonstrates the utility of GBS to investigate fine-scale population structure of gelatinous zooplankton species and shows high population connectivity among nonindigenous populations of this recently introduced species in the North Sea. OPEN RESEARCH BADGES This article has earned an Open Data Badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available at: The DNA sequences generated for this study are deposited in the NCBI sequence read archive under SRA accession numbers -, and will be made publically available upon publication of this manuscript

    ConservedPrimers 2.0: A high-throughput pipeline for comparative genome referenced intron-flanking PCR primer design and its application in wheat SNP discovery

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In some genomic applications it is necessary to design large numbers of PCR primers in exons flanking one or several introns on the basis of orthologous gene sequences in related species. The primer pairs designed by this target gene approach are called "intron-flanking primers" or because they are located in exonic sequences which are usually conserved between related species, "conserved primers". They are useful for large-scale single nucleotide polymorphism (SNP) discovery and marker development, especially in species, such as wheat, for which a large number of ESTs are available but for which genome sequences and intron/exon boundaries are not available. To date, no suitable high-throughput tool is available for this purpose.</p> <p>Results</p> <p>We have developed, the ConservedPrimers 2.0 pipeline, for designing intron-flanking primers for large-scale SNP discovery and marker development, and demonstrated its utility in wheat. This tool uses non-redundant wheat EST sequences, such as wheat contigs and singleton ESTs, and related genomic sequences, such as those of rice, as inputs. It aligns the ESTs to the genomic sequences to identify unique colinear exon blocks and predicts intron lengths. Intron-flanking primers are then designed based on the intron/exon information using the Primer3 core program or BatchPrimer3. Finally, a tab-delimited file containing intron-flanking primer pair sequences and their primer properties is generated for primer ordering and their PCR applications. Using this tool, 1,922 bin-mapped wheat ESTs (31.8% of the 6,045 in total) were found to have unique colinear exon blocks suitable for primer design and 1,821 primer pairs were designed from these single- or low-copy genes for PCR amplification and SNP discovery. With these primers and subsequently designed genome-specific primers, a total of 1,527 loci were found to contain one or more genome-specific SNPs.</p> <p>Conclusion</p> <p>The ConservedPrimers 2.0 pipeline for designing intron-flanking primers was developed and its utility demonstrated. The tool can be used for SNP discovery, genetic variation assays and marker development for any target genome that has abundant ESTs and a related reference genome that has been fully sequenced. The ConservedPrimers 2.0 pipeline has been implemented as a command-line tool as well as a web application. Both versions are freely available at <url>http://wheat.pw.usda.gov/demos/ConservedPrimers/</url>.</p

    A comparison of SNPs and microsatellites as linkage mapping markers: lessons from the zebra finch (Taeniopygia guttata)

    Get PDF
    Background: Genetic linkage maps are essential tools when searching for quantitative trait loci (QTL). To maximize genome coverage and provide an evenly spaced marker distribution a combination of different types of genetic marker are sometimes used. In this study we created linkage maps of four zebra finch (Taeniopygia guttata) chromosomes (1, 1A, 2 and 9) using two types of marker, Single Nucleotide Polymorphisms (SNPs) and microsatellites. To assess the effectiveness and accuracy of each kind of marker we compared maps built with each marker type separately and with both types of marker combined. Linkage map marker order was validated by making comparisons to the assembled zebra finch genome sequence. Results: We showed that marker order was less reliable and linkage map lengths were inflated for microsatellite maps relative to SNP maps, apparently due to differing error rates between the two types of marker. Guidelines on how to minimise the effects of error are provided. In particular, we show that when combining both types of marker the conventional process of building linkage maps, whereby the most informative markers are added to the map first, has to be modified in order to improve map accuracy. Conclusions: When using multiple types and large numbers of markers to create dense linkage maps, the least error prone loci (SNPs) rather than the most informative should be used to create framework maps before the addition of other potentially more error prone markers (microsatellites). This raises questions about the accuracy of marker order and predicted recombination rates in previous microsatellite linkage maps which were created using the conventional building process, however, provided suitable error detection strategies are followed microsatellite-based maps can continue to be regarded as reasonably reliable
    • 

    corecore