195 research outputs found

    Using Glocal Event Alignment for Comparing Sequences of Significantly Different Lengths

    Get PDF
    This work takes place in the context of conversion rate optimization by enhancing the user experience during navigation on e-commerce web sites. The requirement is to be able to segment visitors into meaningful clusters, which can then be targeted with specific call-to-actions, in order to increase the web site turnover. This paper presents an original approach, which equally combines global- and local-alignment techniques (Needleman-Wunsch and Smith-Waterman) in order to automatically segment visitors according to the sequence of visited pages. Experimental results on synthetic datasets show that our approach out-performs other typically used alignment metrics, such as hybrid approaches or Dynamic Time Warping

    Studying Evolutionary Change: Transdisciplinary Advances in Understanding and Measuring Evolution

    Get PDF
    Evolutionary processes can be found in almost any historical, i.e. evolving, system that erroneously copies from the past. Well studied examples do not only originate in evolutionary biology but also in historical linguistics. Yet an approach that would bind together studies of such evolving systems is still elusive. This thesis is an attempt to narrowing down this gap to some extend. An evolving system can be described using characters that identify their changing features. While the problem of a proper choice of characters is beyond the scope of this thesis and remains in the hands of experts we concern ourselves with some theoretical as well data driven approaches. Having a well chosen set of characters describing a system of different entities such as homologous genes, i.e. genes of same origin in different species, we can build a phylogenetic tree. Consider the special case of gene clusters containing paralogous genes, i.e. genes of same origin within a species usually located closely, such as the well known HOX cluster. These are formed by step- wise duplication of its members, often involving unequal crossing over forming hybrid genes. Gene conversion and possibly other mechanisms of concerted evolution further obfuscate phylogenetic relationships. Hence, it is very difficult or even impossible to disentangle the detailed history of gene duplications in gene clusters. Expanding gene clusters that use unequal crossing over as proposed by Walter Gehring leads to distinctive patterns of genetic distances. We show that this special class of distances helps in extracting phylogenetic information from the data still. Disregarding genome rearrangements, we find that the shortest Hamiltonian path then coincides with the ordering of paralogous genes in a cluster. This observation can be used to detect ancient genomic rearrangements of gene clus- ters and to distinguish gene clusters whose evolution was dominated by unequal crossing over within genes from those that expanded through other mechanisms. While the evolution of DNA or protein sequences is well studied and can be formally described, we find that this does not hold for other systems such as language evolution. This is due to a lack of detectable mechanisms that drive the evolutionary processes in other fields. Hence, it is hard to quantify distances between entities, e.g. languages, and therefore the characters describing them. Starting out with distortions of distances, we first see that poor choices of the distance measure can lead to incorrect phylogenies. Given that phylogenetic inference requires additive metrics we can infer the correct phylogeny from a distance matrix D if there is a monotonic, subadditive function ζ such that ζ^−1(D) is additive. We compute the metric-preserving transformation ζ as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process is missing. Yet, this does not hinder studies of language evolution using automated tools. As the amount of available and large digital corpora increased so did the possibilities to study them automatically. The obvious parallels between historical linguistics and phylogenetics lead to many studies adapting bioinformatics tools to fit linguistics means. Here, we use jAlign to calculate bigram alignments, i.e. an alignment algorithm that operates with regard to adjacency of letters. Its performance is tested in different cognate recognition tasks. Using pairwise alignments one major obstacle is the systematic errors they make such as underestimation of gaps and their misplacement. Applying multiple sequence alignments instead of a pairwise algorithm implicitly includes more evolutionary information and thus can overcome the problem of correct gap placement. They can be seen as a generalization of the string-to-string edit problem to more than two strings. With the steady increase in computational power, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are consid- ered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Thus, a general formal frame- work that gives raise to a classification of partially local alignment problems is introduced. It leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems

    Comparative genomics using Fugu reveals insights into regulatory subfunctionalization

    Get PDF
    Fish-mammal genomic alignments were used to compare over 800 conserved non-coding elements that associate with genes that have undergone fish-specific duplication and retention, revealing a pattern of element retention and loss between paralogs indicative of subfunctionalization

    Algorytmy i modele do analizy struktur białkowych

    Get PDF
    In this work we present several algorithmic approaches designed to help researchers in the study of various orders of protein structure. To facilitate the study of molecular sequence evolution we present an algorithm for multiple alignment of sequence profiles, describe a tool that can be used to study the relationship between residue co-evolution and structure, and a database of structures modeled based co-evolutionary approach. On the structure side, a new algorithm for knot type assignment in biological molecules is introduced, a database of linked protein structures is described, and a method of fixing structure models in a topologically-conscious way is presented. Additionally, folding pathways of several newly discovered knotted proteins are proposed, and the influence of coevolution-based interactions of folding simulations discussed.Niniejsza rozprawa doktorska omawia szereg metod mających zastosowanie w badaniu białek na wielu płaszczyznach. Pierwszy rozdział wprowadza nowy algorytm pozwalający na określenie typu węzła w biocząsteczkach. Drugi rozdział poświęcony jest ewolucji sekwencji molekularnych. Na początku opisany jest nowy algorytm do multiuliniawiania profili sekwencyjnych oraz jego zastosowanie w badaniu ewolucji białek membranowych zawierających zduplikowane domeny. Następnie przedstawione jest narzędzie pozwalające na badanie związków między koewolucją sekwencji (znalezioną poprzez metodę Direct Coupling Analysis), a strukturą cząsteczki, oraz baza danych struktur wymodelowanych na podstawie koewolucji sekwencji. Wreszcie przedstawione jest zastosowanie oddziaływań wskazanych przez koewolucję w symulacjach zwijania białek. Ostatni rozdział poświęcony jest badaniom nietrywialnych topologicznie struktur białek, poprzez bazę danych struktur zawierających linki oraz metodę naprawy modeli struktur z zachowaniem właściwej topologii. Na koniec przedstawione są propozycje ścieżek zwijania dla nowopoznanych struktur białek z węzłami

    Fine scale structural variants distinguish the genomes of Drosophila melanogaster and D. pseudoobscura

    Get PDF
    BACKGROUND: A primary objective of comparative genomics is to identify genomic elements of functional significance that contribute to phenotypic diversity. Complex changes in genome structure (insertions, duplications, rearrangements, translocations) may be widespread, and have important effects on organismal diversity. Any survey of genomic variation is incomplete without an assessment of structural changes. RESULTS: We re-examine the genome sequences of the diverged species Drosophila melanogaster and D. pseudoobscura to identify fine-scale structural features that distinguish the genomes. We detect 95 large insertion/deletion events that occur within the introns of orthologous gene pairs, the majority of which represent insertion of transposable elements. We also identify 143 microinversions below 5 kb in size. These microinversions reside within introns or just upstream or downstream of genes, and invert conserved DNA sequence. The sequence conservation within microinversions suggests they may be enriched for functional genetic elements, and their position with respect to known genes implicates them in the regulation of gene expression. Although we found a distinct pattern of GC content across microinversions, this was indistinguishable from the pattern observed across blocks of conserved non-coding sequence. CONCLUSION: Drosophila has long been known as a genus harboring a variety of large inversions that disrupt chromosome colinearity. Here we demonstrate that microinversions, many of which are below 1 kb in length, located in/near genes may also be an important source of genetic variation in Drosophila. Further examination of other Drosophila genome sequences will likely identify an array of novel microinversion events

    Paleovirological Analyses of Endogenous Retroviruses and Host Innate Immune Effectors

    Get PDF
    About 8 and 10 percent of the human and mouse genomes, respectively, are comprised of sequences of retroviral origin. Occasional infection of germ line can lead to integrated retroviral genomes being vertically inherited as host alleles. During thousands to millions of years, some of these sequences acquired inactivating mutations and were fixed in ancestral populations by genetic drift, while others became fixed by providing an evolutionary advantage to the host. Those inherited proviruses are termed endogenous retroviruses (ERVs) and have been identified in a variety of animal species representing an extensive viral “fossil” record of past retroviral infections. With the advent of whole genome sequencing projects and high throughput sequencing platforms, it became evident the wide diversity and the important role that these sequences have had in the evolution of their hosts. In the present study we developed a computational framework to identify ERVs in primate and murine genomes. The results of these genome screenings were used to identify suitable candidate sequences in which to perform paleovirological analyses that lead to the successful reconstruction of two ancient retroviruses. MuERV-L is an env-deficient highly abundant mouse specific ERV that has undergone two amplification bursts, being the more recent and prolific ~2 million years ago (MYA), probably through entirely intracellular mechanisms. MuERV-L is transcriptionally active at the two-cell stage of the mouse embryo and recent studies have implicated the co-option of its LTR as a promoter for totipotency genes. In the present work, we describe the analysis and reconstruction of an infectious ancestral MuERV-L (ancML) sequence through paleovirological analyses of MuERV-L elements in the mouse genome. The resulting ancML sequence was infectious in CHO cells and its replication was dependent on reverse transcription. We found that IFN-α could reduce ancML replication by ~20 fold. Additionally, we found that the expression of mouse APOBEC3 was able to restrict the replication of ancML. However, inspection of endogenous MuERV-L sequences suggested that the impact of APOBEC3 mediated hypermutation on MuERV-L evolution was limited. We discussed the possibility that type I IFN responses (maybe through restriction factors) might inhibit MuERV-L replication at the two-cell stage of the mouse embryo and have kept MuERV-L copy numbers under control. Although no extant human gammaretroviruses have been identified, HERV-T is a low copy primate ERV lineage that is closely related to the gammaretrovirus genus. Through phylogenetic and genomic analysis of HERV-T insertions we defined three distinct lineages. Two lineages (HERV-T1 and HERV-T2) entered the primate germline after the Old World monkey-ape split about ~32-30 MYA, whereas the other (HERV-T3) entered before this divergence ~40 MYA. Phylogenetic analysis of complete (LTR-gag-pol-env-LTR) proviral sequences showed that HERV-T2 was subjected to APOBEC3 mediated hypermutation, and subsequently expanded in apes, most likely through retrotransposon-like mechanisms. Phylogenetic and statistical analysis of HERV-T3 proviruses allowed us to estimate the sequence of their ~32 MY old ancestor, revealing that its unusually long leader sequence encoded a 855-nucleotide ORF separated from gag by 36 nucleotides. This pre-gag ORF of unknown function putatively codes for a protein that includes a transmembrane domain. Additional analysis of the HERV-T3 ancestral sequence allowed us to reconstruct the corresponding env sequence (ancHTenv). We found that a modern gammaretrovirus (MLV) could be pseudotyped with ancHTenv enabling it to infect a wide variety of primate cell lines with titers that are similar to MLV particles carrying the amphotropic MLV envelope. A single HERV-T proviral insertion in the genome of all great apes contains an env gene with full coding potential. Proteins encoded by the extant human HERV-T envelope gene (HsaHTenv) and one estimated to be encoded by the hominid ancestor were not able to generate infectious MLV pseudotyped particles, probably because HsaHTenv is not correctly processed into its mature and functional form. Statistical and phylogenetic analyses indicate that the env gene in this locus is evolving slower than the rest of the proviral sequences, and that selective pressures have acted on this locus to conserve its envelope sequence. Remarkably, we found that expression of the HsaHTenv was able to specifically block infection by MLV particles pseudotyped with the ancHTenv, but not particles pseudotyped with the amphotropic MLV envelope. Additionally, we identified MOT1 as the receptor used by ancHTenv. Further experiments are needed in order to test the hypothesis that HsaHTenv served as a restriction factor through interference with the receptor once used by HERV-T. As paleovirology also studies the evolution of the host defense mechanisms that have been shaped by past retroviral infections, we investigated the origins and evolution of tetherin, an orphan antiviral protein with no known homologs. We found that tetherin function is encoded by genes that exhibit no sequence homology and share only a common architecture and location in modern jawed vertebrate genomes, indicating an origin of ~450 MYA. Moreover, tetherin is part of a cluster of three potential sister genes that includes pv1 and a putative gene of unknown function, here referred as tm-cc(at), which encode proteins of similar architecture. Some variants of these proteins exhibit antiviral activity while others can be endowed with antiviral activity following a simple modification. Only in a slowly evolving species (coelacanths) does Tetherin exhibit homology to TMCC( aT). We suggest that neofunctionalization, drift and positive selection drove a near complete loss of sequence similarity among modern tetherin genes, and between tetherin and its sister genes. Scenarios by which this orphan gene may have arisen and evolved exemplify how protein modularity, evolvability and robustness can create new functions and preserve them, despite sequence divergence due to genetic conflict with past and present viruses

    A haplome alignment and reference sequence of the highly polymorphic Ciona savignyi genome

    Get PDF
    The high degree of polymorphism in the genome of the sea squirt Ciona savignyi complicated the assembly of sequence contigs, but a new alignment method results in a much improved sequence

    Genome Evolution in the Salicaceae: Genetic Novelty, Horizontal Gene Transfer, and Comparative Genomics

    Get PDF
    Genome evolution is a powerful force which shapes genomes over time through processes like mutation, horizontal transfer, and sexual reproduction. Although questions which aim to explore genome evolution are broad, they are all understood through the discovery and comparison of genetic variation. For example, genetic diversity may explain differences in phenotypes, etiology of disease, and is essential for phylogenomic analysis. Recently, the democratization of next generation and third generation DNA sequencing technologies have allowed for genomics to produce large amounts of sequence data. This has facilitated the capture of genetic variation at species and population scales. Populus and Salix are members of the Salicaceae family and are ecologically and economically important woody plants. Currently, there are multiple high-quality reference genomes available for these two genera. Two important sources of genome evolution that will be explored here are genetic novelty in the form of new genes and horizontal gene transfer from the organelle genomes. In the context of genome evolution, both processes have been shown to contribute to beneficial phenotypes as well as disease. The primary contributions of this dissertation research are to identify and assign putative functions to orphan and de novo genes in P. trichocarpa, identify and compare horizontal transfer from the organelle genomes to the nuclear genomes of P. trichocarpa and P. deltoides, and generate new organelle genome resources for 6 different Salix species
    corecore