425 research outputs found

    A new approach to in silico SNP detection and some new SNPs in the Bacillus anthracis genome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Bacillus anthracis </it>is one of the most monomorphic pathogens known. Identification of polymorphisms in its genome is essential for taxonomic classification, for determination of recent evolutionary changes, and for evaluation of pathogenic potency.</p> <p>Findings</p> <p>In this work three strains of the <it>Bacillus anthracis </it>genome are compared and previously unpublished single nucleotide polymorphisms (SNPs) are revealed. Moreover, it is shown that, despite the highly monomorphic nature of <it>Bacillus anthracis</it>, the SNPs are (1) abundant in the genome and (2) distributed relatively uniformly across the sequence.</p> <p>Conclusions</p> <p>The findings support the proposition that SNPs, together with indels and variable number tandem repeats (VNTRs), can be used effectively not only for the differentiation of perfect strain data, but also for the comparison of moderately incomplete, noisy and, in some cases, unknown <it>Bacillus anthracis </it>strains. In the case when the data is of still lower quality, a new DNA sequence fingerprinting approach based on recently introduced markers, based on combinatorial-analytic concepts and called cyclic difference sets, can be used.</p

    Robust Algorithms for Detecting Hidden Structure in Biological Data

    Get PDF
    Biological data, such as molecular abundance measurements and protein sequences, harbor complex hidden structure that reflects its underlying biological mechanisms. For example, high-throughput abundance measurements provide a snapshot the global state of a living cell, while homologous protein sequences encode the residue-level logic of the proteins\u27 function and provide a snapshot of the evolutionary trajectory of the protein family. In this work I describe algorithmic approaches and analysis software I developed for uncovering hidden structure in both kinds of data. Clustering is an unsurpervised machine learning technique commonly used to map the structure of data collected in high-throughput experiments, such as quantification of gene expression by DNA microarrays or short-read sequencing. Clustering algorithms always yield a partitioning of the data, but relying on a single partitioning solution can lead to spurious conclusions. In particular, noise in the data can cause objects to fall into the same cluster by chance rather than due to meaningful association. In the first part of this thesis I demonstrate approaches to clustering data robustly in the presence of noise and apply robust clustering to analyze the transcriptional response to injury in a neuron cell. In the second part of this thesis I describe identifying hidden specificity determining residues (SDPs) from alignments of protein sequences descended through gene duplication from a common ancestor (paralogs) and apply the approach to identify numerous putative SDPs in bacterial transcription factors in the LacI family. Finally, I describe and demonstrate a new algorithm for reconstructing the history of duplications by which paralogs descended from their common ancestor. This algorithm addresses the complexity of such reconstruction due to indeterminate or erroneous homology assignments made by sequence alignment algorithms and to the vast prevalence of divergence through speciation over divergence through gene duplication in protein evolution

    A practical guide to design and assess a phylogenomic study

    Full text link
    Over the last decade, molecular systematics has undergone a change of paradigm as high-throughput sequencing now makes it possible to reconstruct evolutionary relationships using genome-scale datasets. The advent of 'big data' molecular phylogenetics provided a battery of new tools for biologists but simultaneously brought new methodological challenges. The increase in analytical complexity comes at the price of highly specific training in computational biology and molecular phy- logenetics, resulting very often in a polarized accumulation of knowledge (technical on one side and biological on the other). Interpreting the robustness of genome-scale phylogenetic studies is not straightforward, particularly as new methodological developments have consistently shown that the general belief of 'more genes, more robustness' often does not apply, and because there is a range of systematic errors that plague phylogenomic investigations. This is particularly problematic because phylogenomic studies are highly heterogeneous in their methodology, and best practices are often not clearly defined. The main aim of this article is to present what I consider as the ten most important points to take into consideration when plan- ning a well-thought-out phylogenomic study and while evaluating the quality of published papers. The goal is to provide a practical step-by-step guide that can be easily followed by nonexperts and phylogenomic novices in order to assess the tech- nical robustness of phylogenomic studies or improve the experimental design of a project

    The impact of sequencing depth on the inferred taxonomic composition and AMR gene content of metagenomic samples

    Get PDF
    Shotgun metagenomics is increasingly used to characterise microbial communities, particularly for the investigation of antimicrobial resistance (AMR) in different animal and environmental contexts. There are many different approaches for inferring the taxonomic composition and AMR gene content of complex community samples from shotgun metagenomic data, but there has been little work establishing the optimum sequencing depth, data processing and analysis methods for these samples. In this study we used shotgun metagenomics and sequencing of cultured isolates from the same samples to address these issues. We sampled three potential environmental AMR gene reservoirs (pig caeca, river sediment, effluent) and sequenced samples with shotgun metagenomics at high depth (~ 200 million reads per sample). Alongside this, we cultured single-colony isolates of Enterobacteriaceae from the same samples and used hybrid sequencing (short- and long-reads) to create high- quality assemblies for comparison to the metagenomic data. To automate data processing, we developed an open- source software pipeline, ‘ResPipe’

    Identification of unique neoantigen qualities in long-term survivors of pancreatic cancer

    Get PDF
    Pancreatic ductal adenocarcinoma is a lethal cancer with fewer than 7% of patients surviving past 5 years. T-cell immunity has been linked to the exceptional outcome of the few long-term survivors1,2, yet the relevant antigens remain unknown. Here we use genetic, immunohistochemical and transcriptional immunoprofiling, computational biophysics, and functional assays to identify T-cell antigens in long-term survivors of pancreatic cancer. Using whole-exome sequencing and in silico neoantigen prediction, we found that tumours with both the highest neoantigen number and the most abundant CD8+ T-cell infiltrates, but neither alone, stratified patients with the longest survival. Investigating the specific neoantigen qualities promoting T-cell activation in long-term survivors, we discovered that these individuals were enriched in neoantigen qualities defined by a fitness model, and neoantigens in the tumour antigen MUC16 (also known as CA125). A neoantigen quality fitness model conferring greater immunogenicity to neoantigens with differential presentation and homology to infectious disease-derived peptides identified long-term survivors in two independent datasets, whereas a neoantigen quantity model ascribing greater immunogenicity to increasing neoantigen number alone did not. We detected intratumoural and lasting circulating T-cell reactivity to both high-quality and MUC16 neoantigens in long-term survivors of pancreatic cancer, including clones with specificity to both high-quality neoantigens and predicted cross-reactive microbial epitopes, consistent with neoantigen molecular mimicry. Notably, we observed selective loss of high-quality and MUC16 neoantigenic clones on metastatic progression, suggesting neoantigen immunoediting. Our results identify neoantigens with unique qualities as T-cell targets in pancreatic ductal adenocarcinoma. More broadly, we identify neoantigen quality as a biomarker for immunogenic tumours that may guide the application of immunotherapies

    Integrating genomics with the fossil record to explore the evolutionary history of Echinoidea

    Get PDF
    Echinoidea constitutes one of five major clades of living echinoderms, marine animals uniquely characterized by a pentaradial symmetry. Approximately 1,000 living and 10,000 extinct species have been described, including many commonly known as sea urchins, heart urchins and sand dollars. Today, echinoids are ubiquitous in benthic marine environments, where they strongly affect the functioning of biodiverse communities such as coral reefs and kelp forests. Given the quality of their fossil record, their remarkable morphological complexity and our thorough understanding of their development, echinoids provide unparalleled opportunities to explore evolutionary questions in deep-time, providing access to the developmental and morphological underpinnings of evolutionary innovation. These questions cannot be addressed without first resolving the phylogenetic relationships among living and extinct lineages. The goal of this dissertation is to advance our understanding of echinoid relationships and evolutionary history, as well as to explore more broadly the integration of phylogenomic, morphological and paleontological data in phylogenetic reconstruction and macroevolutionary inference.In Chapter 1, I report the results of the first phylogenomic analysis of echinoids based on the sequencing of 17 novel echinoid transcriptomes. Phylogenetic analyses of this data resolve the position of several clades—including the sand dollars—in disagreement with traditional morphological hypotheses. I demonstrate the presence of a strong phylogenetic signal for these novel resolutions, and explore scenarios to reconcile these findings with morphological evidence. In Chapter 7, I extend this approach with a more thorough taxon sampling, resulting in a robust topology with a near-complete sampling of major echinoid lineages. This effort reveals that apatopygids, a clade of three species with previously unclear affinities, represent the only living descendants of a once diverse Mesozoic clade. I also perform a thorough time calibration analysis, quantifying the relative effects of choosing among alternative models of molecular evolution, gene samples and clock priors. I introduce the concept of a chronospace and use it to reveal that only the last among the aforementioned choices affects significantly our understanding of echinoid diversification. Molecular clocks unambiguously support late Permian and late Cretaceous origins for crown group echinoids and sand dollars, respectively, implying long ghost ranges for both. Fossils have been shown to improve the accuracy of phylogenetic comparative methods, warranting their inclusion alongside extant terminals when exploring evolutionary processes across deep timescales. However, their impact on topological inference remains controversial. I explore this topic in Chapter 3 with the use of simulations, which show that morphological phylogenies are more accurate when fossil taxa are incorporated. I also show that tip-dated Bayesian inference, which takes stratigraphic information from fossils into account, outperforms uncalibrated methods. This approach is complemented in Chapter 2 with the analysis of empirical datasets, confirming that incorporating fossils reshapes phylogenies in a manner that is entirely distinct from increased sampling of extant taxa, a result largely attributable to the occurrence of distinctive character combinations among fossils. Even though phylogenomic and paleontological data are complementary resources for unraveling the relationships and divergence times of lineages, few studies have attempted to fully integrate them. Chapter 4 revisits the phylogeny of crown group Echinoidea using a total-evidence dating approach combining phylogenomic, morphological and stratigraphic information. To this end, I develop a method (genesortR) for subsampling molecular datasets that selects loci with high phylogenetic signal and low systematic biases. The results demonstrate that combining different data sources increases topological accuracy and helps resolve phylogenetic conflicts. Notably, I present a new hypothesis for the origin and early morphological evolution of the sand dollars and close allies. In Chapter 6, I compare the behavior of genesortR against alternative subsampling strategies across a sample of phylogenomic matrices. I find this method to systematically outperform random loci selection, unlike commonly-used approaches that target specific evolutionary rates or minimize sources of systematic error. I conclude that these methods should not be used indiscriminately, and that multivariate methods of phylogenomic subsampling should be favored. Finally, in Chapter 5, I explore the macroevolutionary dynamics of echinoid body size across 270 million years using data for more than 5,000 specimens in a phylogenetically explicit context. I also develop a method (extendedSurface) for parameterizing adaptive landscapes that overcomes issues with existing approaches and finds better fitting models. While echinoid body size has been largely constrained to evolve within a single adaptive peak, the disparity of the clade was generated by regime shifts driving the repeated evolution of miniaturized and gigantic forms. Most innovations occurred during the latter half of the Mesozoic, and were followed by a drastic slowdown in the aftermath of the Cretaceous-Paleogene mass extinction

    An analysis of single amino acid repeats as use case for application specific background models

    Get PDF
    Background Sequence analysis aims to identify biologically relevant signals against a backdrop of functionally meaningless variation. Increasingly, it is recognized that the quality of the background model directly affects the performance of analyses. State-of-the-art approaches rely on classical sequence models that are adapted to the studied dataset. Although performing well in the analysis of globular protein domains, these models break down in regions of stronger compositional bias or low complexity. While these regions are typically filtered, there is increasing anecdotal evidence of functional roles. This motivates an exploration of more complex sequence models and application-specific approaches for the investigation of biased regions. Results Traditional Markov-chains and application-specific regression models are compared using the example of predicting runs of single amino acids, a particularly simple class of biased regions. Cross-fold validation experiments reveal that the alternative regression models capture the multi-variate trends well, despite their low dimensionality and in contrast even to higher-order Markov-predictors. We show how the significance of unusual observations can be computed for such empirical models. The power of a dedicated model in the detection of biologically interesting signals is then demonstrated in an analysis identifying the unexpected enrichment of contiguous leucine-repeats in signal-peptides. Considering different reference sets, we show how the question examined actually defines what constitutes the 'background'. Results can thus be highly sensitive to the choice of appropriate model training sets. Conversely, the choice of reference data determines the questions that can be investigated in an analysis. Conclusions Using a specific case of studying biased regions as an example, we have demonstrated that the construction of application-specific background models is both necessary and feasible in a challenging sequence analysis situation
    corecore