53 research outputs found

    Assembly, quantification, and downstream analysis for high trhoughput sequencing data

    Get PDF
    Next Generation Sequencing is a set of relatively recent but already well-established technologies with a wide range of applications in life sciences. Despite the fact that they are constantly being improved, multiple challenging problems still exist in the analysis of high throughput sequencing data. In particular, genome assembly still suffers from inability of technologies to overcome issues related to such structural properties of genomes as single nucleotide polymorphisms and repeats, not even mentioning the drawbacks of technologies themselves like sequencing errors which also hinder the reconstruction of the true reference genomes. Other types of issues arise in transcriptome quantification and differential gene expression analysis. Processing millions of reads requires sophisticated algorithms which are able to compute gene expression with high precision and in reasonable amount of time. Following downstream analysis, the utmost computational task is to infer the activity of biological pathways (e.g., metabolic). With many overlapping pathways challenge is to infer the role of each gene in activity of a given pathway. Assignment products of a gene to a wrong pathway may result in misleading differential activity analysis, and thus, wrong scientific conclusions. In this dissertation I present several algorithmic solutions to some of the enumerated problems above. In particular, I designed scaffolding algorithm for genome assembly and created new tools for differential gene and biological pathways expression analysis

    The inference of gene trees with species trees

    Get PDF
    Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can co-exist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice-versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. In this article we review the various models that have been used to describe the relationship between gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a better basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution.Comment: Review article in relation to the "Mathematical and Computational Evolutionary Biology" conference, Montpellier, 201

    Weighted Minimum-Length Rearrangement Scenarios

    Get PDF
    We present the first known model of genome rearrangement with an arbitrary real-valued weight function on the rearrangements. It is based on the dominant model for the mathematical and algorithmic study of genome rearrangement, Double Cut and Join (DCJ). Our objective function is the sum or product of the weights of the DCJs in an evolutionary scenario, and the function can be minimized or maximized. If the likelihood of observing an independent DCJ was estimated based on biological conditions, for example, then this objective function could be the likelihood of observing the independent DCJs together in a scenario. We present an O(n^4)-time dynamic programming algorithm solving the Minimum Cost Parsimonious Scenario (MCPS) problem for co-tailed genomes with n genes (or syntenic blocks). Combining this with our previous work on MCPS yields a polynomial-time algorithm for general genomes. The key theoretical contribution is a novel link between the parsimonious DCJ (or 2-break) scenarios and quadrangulations of a regular polygon. To demonstrate that our algorithm is fast enough to treat biological data, we run it on syntenic blocks constructed for Human paired with Chimpanzee, Gibbon, Mouse, and Chicken. We argue that the Human and Gibbon pair is a particularly interesting model for the study of weighted genome rearrangements

    Phylogenetic assembly of paleogenomes integrating ancient DNA data

    Get PDF
    Luhmann N. Phylogenetic assembly of paleogenomes integrating ancient DNA data. Bielefeld: Universität Bielefeld; 2017.In comparative genomics, reconstructing the genomes of ancestral species in a given phylogeny is an important problem in order to analyze genome evolution over time. The diversity of present-day genomes in terms of local mutations and genome rearrangements allows to shed light on the dynamics of evolutionary processes that led from a common ancestor to a set of extant genomes. This speciation history is depicted in a phylogenetic tree. Comparative genome reconstruction methods aim to infer genomic features such as an order of markers (e.g. genes) for extinct species at internal nodes of the tree by applying different evolutionary models, relying only on the information available for the extant genomes at the leaves of the phylogenetic tree. Recently, the steady progress in sequencing technologies led to the emergence of the field of paleogenomics, where the study of ancient DNA (aDNA) found in conserved organic material is moving rapidly towards the sequencing and analysis of complete paleogenomes. Such ''genetic time travel'' allows direct insight into specific phases of the evolution of specific genomes that are not only implicitly inferred from extant DNA sequences. However, as DNA is naturally degraded over time after the death of an organism and environmental conditions interfere with the conservation of DNA material, an assembly of these paleogenomes is usually fragmented, preventing a detailed analysis of genome rearrangements along the branches of the phylogenetic tree. In this thesis, we aim to combine the study of aDNA and comparative ancestral reconstruction in a phylogenetic framework. The comparison with extant related genomes can naturally assist in scaffolding a fragmented aDNA assembly, while the aDNA sequencing data can be included as an additional source of information for comparative reconstruction methods to improve the reconstructions of all related genomes in the phylogenetic tree. Our first focus is on integrative methods to reconstruct marker orders globally in a phylogeny under the assumption of parsimony. An underlying rearrangement model can describe the evolutionary operations that occurred along the edges of the tree. However, as much as complex rearrangement scenarios can give insights into underlying biological mechanisms during evolution, from an computational point of view the ancestral reconstruction problem under rearrangement distances is an NP-hard problem. One exception is the Single-Cut-or-Join (SCJ) distance, that uses a marker order-based representation of the involved genomes to model the cut and join of marker adjacencies as evolutionary operations. We build upon this rearrangement model and describe parsimony-based reconstruction methods aiming to minimize the SCJ distance in the tree. In addition, we require the reconstructed solutions to be consistent, such that they represent linear or circular regions of the ancestral genome. Our first polynomial-time method is based on the Sankoff-Rousseau algorithm and directly includes an aDNA assembly graph at one internal node of the tree. We show that including branch lengths in the underlying tree can avoid ambiguity in practice. Our second approach follows a more general strategy and includes the aDNA sequencing data as local weights for adjacencies next to the SCJ distance in the objective. We describe a fixed-parameter-tractable algorithm that also allows to sample co-optimal solutions. Finally, we describe an approach to fill gaps between potentially adjacent markers by aDNA data to reconstruct the complete genome sequence of a paleogenome guided by the related extant genome sequences. In addition, this approach enables us to select the adjacencies that are supported by the sequencing information from sets of conflicting adjacencies. We evaluate our proposed models and algorithms on simulated and biological data. In particular, we integrate two aDNA sequencing data sets for ancient strains of the pathogen Yersinia pestis, that is understood to be the cause of several pandemics in medieval times. We show that the combination of aDNA sequencing reads and a parsimonious reconstruction in the phylogenetic tree reduces the fragmentation of an initial aDNA assembly substantially and explore alternative reconstructions to emphasize reliably reconstructed regions of the ancient genomes

    The domain architecture of large guanine nucleotide exchange factors for the small GTP-binding protein Arf

    Get PDF
    BACKGROUND: Small G proteins, which are essential regulators of multiple cellular functions, are activated by guanine nucleotide exchange factors (GEFs) that stimulate the exchange of the tightly bound GDP nucleotide by GTP. The catalytic domain responsible for nucleotide exchange is in general associated with non-catalytic domains that define the spatio-temporal conditions of activation. In the case of small G proteins of the Arf subfamily, which are major regulators of membrane trafficking, GEFs form a heterogeneous family whose only common characteristic is the well-characterized Sec7 catalytic domain. In contrast, the function of non-catalytic domains and how they regulate/cooperate with the catalytic domain is essentially unknown. RESULTS: Based on Sec7-containing sequences from fully-annotated eukaryotic genomes, including our annotation of these sequences from Paramecium, we have investigated the domain architecture of large ArfGEFs of the BIG and GBF subfamilies, which are involved in Golgi traffic. Multiple sequence alignments combined with the analysis of predicted secondary structures, non-structured regions and splicing patterns, identifies five novel non-catalytic structural domains which are common to both subfamilies, revealing that they share a conserved modular organization. We also report a novel ArfGEF subfamily with a domain organization so far unique to alveolates, which we name TBS (TBC-Sec7). CONCLUSION: Our analysis unifies the BIG and GBF subfamilies into a higher order subfamily, which, together with their being the only subfamilies common to all eukaryotes, suggests that they descend from a common ancestor from which species-specific ArfGEFs have subsequently evolved. Our identification of a conserved modular architecture provides a background for future functional investigation of non-catalytic domains

    The inference of gene trees with species trees.

    Get PDF
    This article reviews the various models that have been used to describe the relationships between gene trees and species trees. Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can coexist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a more reliable basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution

    Conservative route to genome compaction in a miniature annelid

    Get PDF
    The causes and consequences of genome reduction in animals are unclear because our understanding of this process mostly relies on lineages with often exceptionally high rates of evolution. Here, we decode the compact 73.8-megabase genome of Dimorphilus gyrociliatus, a meiobenthic segmented worm. The D. gyrociliatus genome retains traits classically associated with larger and slower-evolving genomes, such as an ordered, intact Hox cluster, a generally conserved developmental toolkit and traces of ancestral bilaterian linkage. Unlike some other animals with small genomes, the analysis of the D. gyrociliatus epigenome revealed canonical features of genome regulation, excluding the presence of operons and trans-splicing. Instead, the gene-dense D. gyrociliatus genome presents a divergent Myc pathway, a key physiological regulator of growth, proliferation and genome stability in animals. Altogether, our results uncover a conservative route to genome compaction in annelids, reminiscent of that observed in the vertebrate Takifugu rubripes
    • …
    corecore