10 research outputs found

    Computing Phylo-k-mers

    Full text link
    Phylogenetically informed k-mers, or phylo-k-mers for short, are k-mers that are predicted to appear within a given genomic region at predefined locations of a fixed phylogeny. Given a reference alignment for this genomic region and assuming a phylogenetic model of sequence evolution, we can compute a probability score for any given k-mer at any given tree node. The k-mers with sufficiently high probabilities can later be used to perform alignment-free phylogenetic classification of new sequences-a procedure recently proposed for the phylogenetic placement of metabarcoding reads and the detection of novel virus recombinants. While computing phylo-k-mers, we need to consider large numbers of k-mers at each tree node, which warrants the development of efficient enumeration algorithms. We consider a formal definition of the problem of phylo-k-mer computation: How to efficiently find all k-mers whose probability lies above a user-defined threshold for a given tree node? We describe and analyze algorithms for this problem, relying on branch-and-bound and divideand-conquer techniques. We exploit the redundancy of adjacent windows of the alignment and the structure of the probability matrix to save on computation. Besides computational complexity analyses, we provide an empirical evaluation of the relative performance of their implementations on real-world and simulated data. The divide-and-conquer algorithms, which to the best of our knowledge are novel, are found to be clear improvements over the branch-and-bound approach, especially when a large number of phylo-k-mers are found

    Calcul de k-mers informatifs pour le placement phylogénétique

    No full text
    Phylogenetic placement determines possible phylogenetic origins of unknown query DNA or protein sequences, given a fixed reference phylogeny.Its main application is species identification, an essential bioinformatics problem with environmental ecology applications, microbial diversity studies, and medicine. Alignment-free methods for phylogenetic placement are a novel group of methods designed to eliminate the need to align query sequences within reference sequences --- a current limit to the applicability of phylogenetic placement methods in the next-generation sequencing (NGS) era.One of such methods is RAPPAS. It introduced the concept of phylogenetically aware k-mers (phylo-k-mers): k-mers paired with relevant probabilistic information about the reference phylogeny. This information determines how probable it is to observe any k-mer in hypothetical sequences arising from different parts of the reference tree. RAPPAS preprocesses the reference phylogenetic tree and alignment, computing phylo-k-mers. This allows fast phylogenetic placement of vast amounts of query sequences; however, the computation of phylo-k-mers is expensive in both running time and memory.This thesis studies the problem of effective indexing of reference phylogenies with phylo-k-mers. Chapter 1 gently introduces the reader to the problem. Starting with a historical overview of biology and bioinformatics of the last decades, it discusses the importance of sequence identification in modern bioinformatics, overwhelmed with amounts of sequencing data produced by NGS technologies. Then, it overviews existing methods of phylogenetic placement and discusses their limitations.Chapter 2 describes and analyzes the existing solution for the central algorithmic problem of phylo-k-mer computation: computing phylo-k-mers for one node in a k-sized window of the reference alignment. In addition, it describes a novel algorithm for this problem based on the divide-and-conquer approach. This algorithm improves the existing solution both theoretically and in practice.Chapter 3 proposes a novel method of filtering phylo-k-mers based on Mutual Information. This method allows reducing memory consumption of phylogenetic placement significantly with a negligible decrease in placement accuracy. It also describes how RAPPAS is connected to well-studied methods of text classification with Naive Bayes.Finally, Chapter 4 presents two new phylo-k-mer-related tools: XPAS for efficient computation of phylo-k-mers and RAPPAS2, an effective reimplementation of RAPPAS. Experimental results provided show that XPAS and RAPPAS2 outperform RAPPAS both in running speed and memory consumption. Both tools are written in modern C++, optimized for efficiency, and are ready to use.The final chapter discusses possible directions of future work on phylo-k-mer-related methods, the challenges that are yet to be overcome, and a discussion on the future of phylogenetic placement.Étant donnĂ© un arbre d'Ă©volution des espĂšces (ou phylogĂ©nie) et les sĂ©quences de rĂ©fĂ©rence qui ont permis de la construire, le placement phylogĂ©nĂ©tique tente de dĂ©terminer la branche d'origine d'une sĂ©quence requĂȘte dans la phylogĂ©nie. L'application principale du placement phylogĂ©nĂ©tique est l'identification d'espĂšces, une question essentielle de bioinformatique utilisĂ©e en Ă©cologie, en agronomie et en mĂ©decine. Les mĂ©thodes algorithmiques dites og sans alignementfg proposent une nouvelle approche capable d'Ă©viter d'aligner la sĂ©quence requĂȘte avec les sĂ©quences de rĂ©fĂ©rence, une Ă©tape qui limite fortement le passage Ă  l'Ă©chelle du placement phylogĂ©nĂ©tique Ă  l'Ăšre des technologies de sĂ©quençage Ă  haut-dĂ©bit (SHD).RAPPAS, qui appartient aux mĂ©thodes sans alignement, introduit le concept de k-mer informĂ© phylogĂ©nĂ©tiquement, ou phylo-k-mer pour faire court. Pour un entier (k), il s'agit d'une sĂ©quence de longueur (k) (ou k-mer) associĂ©e Ă  des probabilitĂ©s d'observation sur les branches de la phylogĂ©nie. Pour un k-mer issue d'une sĂ©quence requĂȘte, cela permet d'estimer la probabilitĂ© qu'il provienne de chaque branche de la phylogĂ©nie. RAPPAS prĂ©-traite la phylogĂ©nie et les sĂ©quences associĂ©es pour calculer les phylo-k-mers et les stocker dans un index. Une fois calculĂ©, cet index permet de placer d'Ă©norme quantitĂ© de sĂ©quences requĂȘtes ; cependant sa construction demeure coĂ»teuse en temps de calcul et en mĂ©moire. Cette thĂšse Ă©tudie le calcul et l'indexation efficaces des phylo-k-mers. Le chapitre 1 introduit la thĂ©matique aprĂšs un survol historique des notions de biologie et de bio-informatique nĂ©cessaires Ă  sa comprĂ©hension. Il discute de l'importance de l'identification d'espĂšces par sĂ©quençage et bio-informatique, ainsi que du dĂ©fi causĂ© par le SHD. Enfin, il prĂ©sente un Ă©tat de l'art bio-informatique du placement phylogĂ©nĂ©tique. Le chapitre 2 dĂ©crit et analyse l'algorithme existant de l'Ă©tape centrale du calcul des phylo-k-mers : le calcul des phylo-k-mers pour une fenĂȘtre de longueur (k) de l'alignement de rĂ©fĂ©rence. En outre, il propose un nouvel algorithme utilisant une stratĂ©gie de og diviser pour rĂ©gnerfg. Cette approche surpasse l'algorithme existant tant en thĂ©orie qu'en pratique. Cependant le volume mĂ©moire occupĂ© par l'index de phylo-k-mers et leur nombre peuvent dans certains cas s'avĂ©rer gĂȘnants. Le chapitre 3 propose de sĂ©lectionner les phylo-k-mers les plus informatifs en se basant sur l'information mutuelle. L'algorithme de filtrage proposĂ© permet de rĂ©duire considĂ©rablement l'espace nĂ©cessaire en impactant la prĂ©cision du placement de façon nĂ©gligeable. Enfin, il examine la connexion entre RAPPAS et les mĂ©thodes d'apprentissage automatique de classification de textes basĂ©es sur une approche dite og bayĂ©sienne naĂŻvefg. Le chapitre 4 dĂ©crit deux nouveaux programmes permettant le calcul et l'utilisation des phylo-k-mers : emph{XPAS} pour le calcul efficace d'index de phylo-k-mers et son stockage sur disque, et emph{RAPPAS2} qui rĂ©implante l'algorithme de placement phylogĂ©nĂ©tique original de RAPPAS. Les rĂ©sultats expĂ©rimentaux dĂ©montrent que ces deux programmes, qui combinĂ©s remplacent le RAPPAS original, rĂ©duisent grandement l'espace mĂ©moire utilisĂ©e et amĂ©liorent fortement les temps de calcul. Leur efficacitĂ© provient de l'implantation en C++ moderne, de leur optimisation, et en fait de programmes d'ores et dĂ©jĂ  utilisables. Finalement, le chapitre de conclusion aborde des pistes de recherche pour l'utilisation des phylo-k-mers, les dĂ©fis Ă  venir, et les perspectives du placement phylogĂ©nĂ©tique

    Computing informative k-mers for phylogenetic placement

    No full text
    Étant donnĂ© un arbre d'Ă©volution des espĂšces (ou phylogĂ©nie) et les sĂ©quences de rĂ©fĂ©rence qui ont permis de la construire, le placement phylogĂ©nĂ©tique tente de dĂ©terminer la branche d'origine d'une sĂ©quence requĂȘte dans la phylogĂ©nie. L'application principale du placement phylogĂ©nĂ©tique est l'identification d'espĂšces, une question essentielle de bioinformatique utilisĂ©e en Ă©cologie, en agronomie et en mĂ©decine. Les mĂ©thodes algorithmiques dites og sans alignementfg proposent une nouvelle approche capable d'Ă©viter d'aligner la sĂ©quence requĂȘte avec les sĂ©quences de rĂ©fĂ©rence, une Ă©tape qui limite fortement le passage Ă  l'Ă©chelle du placement phylogĂ©nĂ©tique Ă  l'Ăšre des technologies de sĂ©quençage Ă  haut-dĂ©bit (SHD).RAPPAS, qui appartient aux mĂ©thodes sans alignement, introduit le concept de k-mer informĂ© phylogĂ©nĂ©tiquement, ou phylo-k-mer pour faire court. Pour un entier (k), il s'agit d'une sĂ©quence de longueur (k) (ou k-mer) associĂ©e Ă  des probabilitĂ©s d'observation sur les branches de la phylogĂ©nie. Pour un k-mer issue d'une sĂ©quence requĂȘte, cela permet d'estimer la probabilitĂ© qu'il provienne de chaque branche de la phylogĂ©nie. RAPPAS prĂ©-traite la phylogĂ©nie et les sĂ©quences associĂ©es pour calculer les phylo-k-mers et les stocker dans un index. Une fois calculĂ©, cet index permet de placer d'Ă©norme quantitĂ© de sĂ©quences requĂȘtes ; cependant sa construction demeure coĂ»teuse en temps de calcul et en mĂ©moire. Cette thĂšse Ă©tudie le calcul et l'indexation efficaces des phylo-k-mers. Le chapitre 1 introduit la thĂ©matique aprĂšs un survol historique des notions de biologie et de bio-informatique nĂ©cessaires Ă  sa comprĂ©hension. Il discute de l'importance de l'identification d'espĂšces par sĂ©quençage et bio-informatique, ainsi que du dĂ©fi causĂ© par le SHD. Enfin, il prĂ©sente un Ă©tat de l'art bio-informatique du placement phylogĂ©nĂ©tique. Le chapitre 2 dĂ©crit et analyse l'algorithme existant de l'Ă©tape centrale du calcul des phylo-k-mers : le calcul des phylo-k-mers pour une fenĂȘtre de longueur (k) de l'alignement de rĂ©fĂ©rence. En outre, il propose un nouvel algorithme utilisant une stratĂ©gie de og diviser pour rĂ©gnerfg. Cette approche surpasse l'algorithme existant tant en thĂ©orie qu'en pratique. Cependant le volume mĂ©moire occupĂ© par l'index de phylo-k-mers et leur nombre peuvent dans certains cas s'avĂ©rer gĂȘnants. Le chapitre 3 propose de sĂ©lectionner les phylo-k-mers les plus informatifs en se basant sur l'information mutuelle. L'algorithme de filtrage proposĂ© permet de rĂ©duire considĂ©rablement l'espace nĂ©cessaire en impactant la prĂ©cision du placement de façon nĂ©gligeable. Enfin, il examine la connexion entre RAPPAS et les mĂ©thodes d'apprentissage automatique de classification de textes basĂ©es sur une approche dite og bayĂ©sienne naĂŻvefg. Le chapitre 4 dĂ©crit deux nouveaux programmes permettant le calcul et l'utilisation des phylo-k-mers : emph{XPAS} pour le calcul efficace d'index de phylo-k-mers et son stockage sur disque, et emph{RAPPAS2} qui rĂ©implante l'algorithme de placement phylogĂ©nĂ©tique original de RAPPAS. Les rĂ©sultats expĂ©rimentaux dĂ©montrent que ces deux programmes, qui combinĂ©s remplacent le RAPPAS original, rĂ©duisent grandement l'espace mĂ©moire utilisĂ©e et amĂ©liorent fortement les temps de calcul. Leur efficacitĂ© provient de l'implantation en C++ moderne, de leur optimisation, et en fait de programmes d'ores et dĂ©jĂ  utilisables. Finalement, le chapitre de conclusion aborde des pistes de recherche pour l'utilisation des phylo-k-mers, les dĂ©fis Ă  venir, et les perspectives du placement phylogĂ©nĂ©tique.Phylogenetic placement determines possible phylogenetic origins of unknown query DNA or protein sequences, given a fixed reference phylogeny.Its main application is species identification, an essential bioinformatics problem with environmental ecology applications, microbial diversity studies, and medicine. Alignment-free methods for phylogenetic placement are a novel group of methods designed to eliminate the need to align query sequences within reference sequences --- a current limit to the applicability of phylogenetic placement methods in the next-generation sequencing (NGS) era.One of such methods is RAPPAS. It introduced the concept of phylogenetically aware k-mers (phylo-k-mers): k-mers paired with relevant probabilistic information about the reference phylogeny. This information determines how probable it is to observe any k-mer in hypothetical sequences arising from different parts of the reference tree. RAPPAS preprocesses the reference phylogenetic tree and alignment, computing phylo-k-mers. This allows fast phylogenetic placement of vast amounts of query sequences; however, the computation of phylo-k-mers is expensive in both running time and memory.This thesis studies the problem of effective indexing of reference phylogenies with phylo-k-mers. Chapter 1 gently introduces the reader to the problem. Starting with a historical overview of biology and bioinformatics of the last decades, it discusses the importance of sequence identification in modern bioinformatics, overwhelmed with amounts of sequencing data produced by NGS technologies. Then, it overviews existing methods of phylogenetic placement and discusses their limitations.Chapter 2 describes and analyzes the existing solution for the central algorithmic problem of phylo-k-mer computation: computing phylo-k-mers for one node in a k-sized window of the reference alignment. In addition, it describes a novel algorithm for this problem based on the divide-and-conquer approach. This algorithm improves the existing solution both theoretically and in practice.Chapter 3 proposes a novel method of filtering phylo-k-mers based on Mutual Information. This method allows reducing memory consumption of phylogenetic placement significantly with a negligible decrease in placement accuracy. It also describes how RAPPAS is connected to well-studied methods of text classification with Naive Bayes.Finally, Chapter 4 presents two new phylo-k-mer-related tools: XPAS for efficient computation of phylo-k-mers and RAPPAS2, an effective reimplementation of RAPPAS. Experimental results provided show that XPAS and RAPPAS2 outperform RAPPAS both in running speed and memory consumption. Both tools are written in modern C++, optimized for efficiency, and are ready to use.The final chapter discusses possible directions of future work on phylo-k-mer-related methods, the challenges that are yet to be overcome, and a discussion on the future of phylogenetic placement

    PEWO: a collection of workflows to benchmark phylogenetic placement

    Get PDF
    International audienceMotivation: Phylogenetic placement (PP) is a process of taxonomic identification for which several tools are now available. However, it remains difficult to assess which tool is more adapted to particular genomic data or a particular reference taxonomy. We developed PEWO, the first benchmarking tool dedicated to PP assessment. Its automated workflows can evaluate PP at many levels, from parameter optimisation for a particular tool, to the selection of the most appropriate genetic marker when PP-based species identifications are targeted. Our goal is that PEWO will become a community effort and a standard supportred for future developments and applications of PP.Availability: https://github.com/phylo42/PEW

    Inequalities for Shannon entropy and Kolmogorov complexity

    Get PDF
    It was mentioned by Kolmogorov in [5] that the properties of algorithmic complexity and Shannon entropy are similar. We investigate one aspect of this similarity. Namely, we are interested in linear inequalities that are valid for Shannon entropy and for Kolmogorov complexity. It turns ou

    Upper Semi-Lattice of Binary Strings With the Relation "x is Simple Conditional to Y"

    No full text
    We study the properties of the set of binary strings with the relation "the Kolmogorov complexity of x conditional to y is small". We prove that there are pairs of strings which have no greatest common lower bound with respect to this pre-order. We present several examples when the greatest common lower bound exists but its complexity is much less than mutual information (extending G'acs and Korner result [2]). 1 Introduction The family of Turing degrees is a well studied object. Is there any finite analog of it? In the present paper we study such a finite analog: instead of subsets of N we take binary strings and instead of Turing reducibility we take the relation "Kolmogorov complexity of x conditional to y is small". This structure has many properties common with the set of Turing degrees. For example, it forms also an upper semi-lattice. And it is more rich, since we can measure the complexity of finite strings. Of course, we should make precise what means that the Kolmogorov com..

    Upper semi-lattice of binary strings with the relation "x is simple conditional to y"

    No full text
    AbstractIn this paper we construct a structure R that is a “finite version” of the semi-lattice of Turing degrees. Its elements are strings (technically, sequences of strings) and xâ©œy means that K(x|y)=(conditionalKolmogorovcomplexityofx relative to y) is small. We construct two elements in R that do not have greatest lower bound. We give a series of examples that show how natural algebraic constructions give two elements that have lower bound 0 (minimal element) but significant mutual information. (A first example of that kind was constructed by GĂĄcs–Körner (Problems Control Inform. Theory 2 (1973) 149) using a completely different technique.) We define a notion of “complexity profile” of the pair of elements of R and give (exact) upper and lower bounds for it in a particular case
    corecore