Search CORE

10 research outputs found

Computing Phylo-k-mers

Author: Linard Benjamin
Pardi Fabio
Rivals Eric
Romashchenko Nikolai
Publication venue
Publication date: 19/09/2022
Field of study

Phylogenetically informed k-mers, or phylo-k-mers for short, are k-mers that are predicted to appear within a given genomic region at predefined locations of a fixed phylogeny. Given a reference alignment for this genomic region and assuming a phylogenetic model of sequence evolution, we can compute a probability score for any given k-mer at any given tree node. The k-mers with sufficiently high probabilities can later be used to perform alignment-free phylogenetic classification of new sequences-a procedure recently proposed for the phylogenetic placement of metabarcoding reads and the detection of novel virus recombinants. While computing phylo-k-mers, we need to consider large numbers of k-mers at each tree node, which warrants the development of efficient enumeration algorithms. We consider a formal definition of the problem of phylo-k-mer computation: How to efficiently find all k-mers whose probability lies above a user-defined threshold for a given tree node? We describe and analyze algorithms for this problem, relying on branch-and-bound and divideand-conquer techniques. We exploit the redundancy of adjacent windows of the alignment and the structure of the probability matrix to save on computation. Besides computational complexity analyses, we provide an empirical evaluation of the relative performance of their implementations on real-world and simulated data. The divide-and-conquer algorithms, which to the best of our knowledge are novel, are found to be clear improvements over the branch-and-bound approach, especially when a large number of phylo-k-mers are found

arXiv.org e-Print Archive

Calcul de k-mers informatifs pour le placement phylogénétique

Author: Romashchenko Nikolai
Publication venue: HAL CCSD
Publication date: 14/12/2021
Field of study

Phylogenetic placement determines possible phylogenetic origins of unknown query DNA or protein sequences, given a fixed reference phylogeny.Its main application is species identification, an essential bioinformatics problem with environmental ecology applications, microbial diversity studies, and medicine. Alignment-free methods for phylogenetic placement are a novel group of methods designed to eliminate the need to align query sequences within reference sequences --- a current limit to the applicability of phylogenetic placement methods in the next-generation sequencing (NGS) era.One of such methods is RAPPAS. It introduced the concept of phylogenetically aware k-mers (phylo-k-mers): k-mers paired with relevant probabilistic information about the reference phylogeny. This information determines how probable it is to observe any k-mer in hypothetical sequences arising from different parts of the reference tree. RAPPAS preprocesses the reference phylogenetic tree and alignment, computing phylo-k-mers. This allows fast phylogenetic placement of vast amounts of query sequences; however, the computation of phylo-k-mers is expensive in both running time and memory.This thesis studies the problem of effective indexing of reference phylogenies with phylo-k-mers. Chapter 1 gently introduces the reader to the problem. Starting with a historical overview of biology and bioinformatics of the last decades, it discusses the importance of sequence identification in modern bioinformatics, overwhelmed with amounts of sequencing data produced by NGS technologies. Then, it overviews existing methods of phylogenetic placement and discusses their limitations.Chapter 2 describes and analyzes the existing solution for the central algorithmic problem of phylo-k-mer computation: computing phylo-k-mers for one node in a k-sized window of the reference alignment. In addition, it describes a novel algorithm for this problem based on the divide-and-conquer approach. This algorithm improves the existing solution both theoretically and in practice.Chapter 3 proposes a novel method of filtering phylo-k-mers based on Mutual Information. This method allows reducing memory consumption of phylogenetic placement significantly with a negligible decrease in placement accuracy. It also describes how RAPPAS is connected to well-studied methods of text classification with Naive Bayes.Finally, Chapter 4 presents two new phylo-k-mer-related tools: XPAS for efficient computation of phylo-k-mers and RAPPAS2, an effective reimplementation of RAPPAS. Experimental results provided show that XPAS and RAPPAS2 outperform RAPPAS both in running speed and memory consumption. Both tools are written in modern C++, optimized for efficiency, and are ready to use.The final chapter discusses possible directions of future work on phylo-k-mer-related methods, the challenges that are yet to be overcome, and a discussion on the future of phylogenetic placement.Étant donné un arbre d'évolution des espèces (ou phylogénie) et les séquences de référence qui ont permis de la construire, le placement phylogénétique tente de déterminer la branche d'origine d'une séquence requête dans la phylogénie. L'application principale du placement phylogénétique est l'identification d'espèces, une question essentielle de bioinformatique utilisée en écologie, en agronomie et en médecine. Les méthodes algorithmiques dites og sans alignementfg proposent une nouvelle approche capable d'éviter d'aligner la séquence requête avec les séquences de référence, une étape qui limite fortement le passage à l'échelle du placement phylogénétique à l'ère des technologies de séquençage à haut-débit (SHD).RAPPAS, qui appartient aux méthodes sans alignement, introduit le concept de k-mer informé phylogénétiquement, ou phylo-k-mer pour faire court. Pour un entier (k), il s'agit d'une séquence de longueur (k) (ou k-mer) associée à des probabilités d'observation sur les branches de la phylogénie. Pour un k-mer issue d'une séquence requête, cela permet d'estimer la probabilité qu'il provienne de chaque branche de la phylogénie. RAPPAS pré-traite la phylogénie et les séquences associées pour calculer les phylo-k-mers et les stocker dans un index. Une fois calculé, cet index permet de placer d'énorme quantité de séquences requêtes ; cependant sa construction demeure coûteuse en temps de calcul et en mémoire. Cette thèse étudie le calcul et l'indexation efficaces des phylo-k-mers. Le chapitre 1 introduit la thématique après un survol historique des notions de biologie et de bio-informatique nécessaires à sa compréhension. Il discute de l'importance de l'identification d'espèces par séquençage et bio-informatique, ainsi que du défi causé par le SHD. Enfin, il présente un état de l'art bio-informatique du placement phylogénétique. Le chapitre 2 décrit et analyse l'algorithme existant de l'étape centrale du calcul des phylo-k-mers : le calcul des phylo-k-mers pour une fenêtre de longueur (k) de l'alignement de référence. En outre, il propose un nouvel algorithme utilisant une stratégie de og diviser pour régnerfg. Cette approche surpasse l'algorithme existant tant en théorie qu'en pratique. Cependant le volume mémoire occupé par l'index de phylo-k-mers et leur nombre peuvent dans certains cas s'avérer gênants. Le chapitre 3 propose de sélectionner les phylo-k-mers les plus informatifs en se basant sur l'information mutuelle. L'algorithme de filtrage proposé permet de réduire considérablement l'espace nécessaire en impactant la précision du placement de façon négligeable. Enfin, il examine la connexion entre RAPPAS et les méthodes d'apprentissage automatique de classification de textes basées sur une approche dite og bayésienne naïvefg. Le chapitre 4 décrit deux nouveaux programmes permettant le calcul et l'utilisation des phylo-k-mers : emph{XPAS} pour le calcul efficace d'index de phylo-k-mers et son stockage sur disque, et emph{RAPPAS2} qui réimplante l'algorithme de placement phylogénétique original de RAPPAS. Les résultats expérimentaux démontrent que ces deux programmes, qui combinés remplacent le RAPPAS original, réduisent grandement l'espace mémoire utilisée et améliorent fortement les temps de calcul. Leur efficacité provient de l'implantation en C++ moderne, de leur optimisation, et en fait de programmes d'ores et déjà utilisables. Finalement, le chapitre de conclusion aborde des pistes de recherche pour l'utilisation des phylo-k-mers, les défis à venir, et les perspectives du placement phylogénétique

Thèses en Ligne

Hal-Diderot

Computing informative k-mers for phylogenetic placement

Author: Romashchenko Nikolai
Publication venue
Publication date: 14/12/2021
Field of study

Étant donné un arbre d'évolution des espèces (ou phylogénie) et les séquences de référence qui ont permis de la construire, le placement phylogénétique tente de déterminer la branche d'origine d'une séquence requête dans la phylogénie. L'application principale du placement phylogénétique est l'identification d'espèces, une question essentielle de bioinformatique utilisée en écologie, en agronomie et en médecine. Les méthodes algorithmiques dites og sans alignementfg proposent une nouvelle approche capable d'éviter d'aligner la séquence requête avec les séquences de référence, une étape qui limite fortement le passage à l'échelle du placement phylogénétique à l'ère des technologies de séquençage à haut-débit (SHD).RAPPAS, qui appartient aux méthodes sans alignement, introduit le concept de k-mer informé phylogénétiquement, ou phylo-k-mer pour faire court. Pour un entier (k), il s'agit d'une séquence de longueur (k) (ou k-mer) associée à des probabilités d'observation sur les branches de la phylogénie. Pour un k-mer issue d'une séquence requête, cela permet d'estimer la probabilité qu'il provienne de chaque branche de la phylogénie. RAPPAS pré-traite la phylogénie et les séquences associées pour calculer les phylo-k-mers et les stocker dans un index. Une fois calculé, cet index permet de placer d'énorme quantité de séquences requêtes ; cependant sa construction demeure coûteuse en temps de calcul et en mémoire. Cette thèse étudie le calcul et l'indexation efficaces des phylo-k-mers. Le chapitre 1 introduit la thématique après un survol historique des notions de biologie et de bio-informatique nécessaires à sa compréhension. Il discute de l'importance de l'identification d'espèces par séquençage et bio-informatique, ainsi que du défi causé par le SHD. Enfin, il présente un état de l'art bio-informatique du placement phylogénétique. Le chapitre 2 décrit et analyse l'algorithme existant de l'étape centrale du calcul des phylo-k-mers : le calcul des phylo-k-mers pour une fenêtre de longueur (k) de l'alignement de référence. En outre, il propose un nouvel algorithme utilisant une stratégie de og diviser pour régnerfg. Cette approche surpasse l'algorithme existant tant en théorie qu'en pratique. Cependant le volume mémoire occupé par l'index de phylo-k-mers et leur nombre peuvent dans certains cas s'avérer gênants. Le chapitre 3 propose de sélectionner les phylo-k-mers les plus informatifs en se basant sur l'information mutuelle. L'algorithme de filtrage proposé permet de réduire considérablement l'espace nécessaire en impactant la précision du placement de façon négligeable. Enfin, il examine la connexion entre RAPPAS et les méthodes d'apprentissage automatique de classification de textes basées sur une approche dite og bayésienne naïvefg. Le chapitre 4 décrit deux nouveaux programmes permettant le calcul et l'utilisation des phylo-k-mers : emph{XPAS} pour le calcul efficace d'index de phylo-k-mers et son stockage sur disque, et emph{RAPPAS2} qui réimplante l'algorithme de placement phylogénétique original de RAPPAS. Les résultats expérimentaux démontrent que ces deux programmes, qui combinés remplacent le RAPPAS original, réduisent grandement l'espace mémoire utilisée et améliorent fortement les temps de calcul. Leur efficacité provient de l'implantation en C++ moderne, de leur optimisation, et en fait de programmes d'ores et déjà utilisables. Finalement, le chapitre de conclusion aborde des pistes de recherche pour l'utilisation des phylo-k-mers, les défis à venir, et les perspectives du placement phylogénétique.Phylogenetic placement determines possible phylogenetic origins of unknown query DNA or protein sequences, given a fixed reference phylogeny.Its main application is species identification, an essential bioinformatics problem with environmental ecology applications, microbial diversity studies, and medicine. Alignment-free methods for phylogenetic placement are a novel group of methods designed to eliminate the need to align query sequences within reference sequences --- a current limit to the applicability of phylogenetic placement methods in the next-generation sequencing (NGS) era.One of such methods is RAPPAS. It introduced the concept of phylogenetically aware k-mers (phylo-k-mers): k-mers paired with relevant probabilistic information about the reference phylogeny. This information determines how probable it is to observe any k-mer in hypothetical sequences arising from different parts of the reference tree. RAPPAS preprocesses the reference phylogenetic tree and alignment, computing phylo-k-mers. This allows fast phylogenetic placement of vast amounts of query sequences; however, the computation of phylo-k-mers is expensive in both running time and memory.This thesis studies the problem of effective indexing of reference phylogenies with phylo-k-mers. Chapter 1 gently introduces the reader to the problem. Starting with a historical overview of biology and bioinformatics of the last decades, it discusses the importance of sequence identification in modern bioinformatics, overwhelmed with amounts of sequencing data produced by NGS technologies. Then, it overviews existing methods of phylogenetic placement and discusses their limitations.Chapter 2 describes and analyzes the existing solution for the central algorithmic problem of phylo-k-mer computation: computing phylo-k-mers for one node in a k-sized window of the reference alignment. In addition, it describes a novel algorithm for this problem based on the divide-and-conquer approach. This algorithm improves the existing solution both theoretically and in practice.Chapter 3 proposes a novel method of filtering phylo-k-mers based on Mutual Information. This method allows reducing memory consumption of phylogenetic placement significantly with a negligible decrease in placement accuracy. It also describes how RAPPAS is connected to well-studied methods of text classification with Naive Bayes.Finally, Chapter 4 presents two new phylo-k-mer-related tools: XPAS for efficient computation of phylo-k-mers and RAPPAS2, an effective reimplementation of RAPPAS. Experimental results provided show that XPAS and RAPPAS2 outperform RAPPAS both in running speed and memory consumption. Both tools are written in modern C++, optimized for efficiency, and are ready to use.The final chapter discusses possible directions of future work on phylo-k-mer-related methods, the challenges that are yet to be overcome, and a discussion on the future of phylogenetic placement

Theses.fr

PEWO: a collection of workflows to benchmark phylogenetic placement

Author: Linard Benjamin
Pardi Fabio
Rivals Eric
Romashchenko Nikolai
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2020
Field of study

International audienceMotivation: Phylogenetic placement (PP) is a process of taxonomic identification for which several tools are now available. However, it remains difficult to assess which tool is more adapted to particular genomic data or a particular reference taxonomy. We developed PEWO, the first benchmarking tool dedicated to PP assessment. Its automated workflows can evaluate PP at many levels, from parameter optimisation for a particular tool, to the selection of the most appropriate genetic marker when PP-based species identifications are targeted. Our goal is that PEWO will become a community effort and a standard supportred for future developments and applications of PP.Availability: https://github.com/phylo42/PEW

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

HAL-CEA

Inequalities for Shannon entropy and Kolmogorov complexity

Author: Alexander Shen
Andrei Romashchenko
Daniel Hammer
Nikolai Vereshchagin
Publication venue
Publication date
Field of study

It was mentioned by Kolmogorov in [5] that the properties of algorithmic complexity and Shannon entropy are similar. We investigate one aspect of this similarity. Namely, we are interested in linear inequalities that are valid for Shannon entropy and for Kolmogorov complexity. It turns ou

CiteSeerX

Elsevier - Publisher Connector

Upper Semi-Lattice of Binary Strings With the Relation "x is Simple Conditional to Y"

Author: Alexander Shen
Andrei Muchnik
Andrei Romashchenko
Nikolai Vereshagin
Publication venue
Publication date
Field of study

We study the properties of the set of binary strings with the relation "the Kolmogorov complexity of x conditional to y is small". We prove that there are pairs of strings which have no greatest common lower bound with respect to this pre-order. We present several examples when the greatest common lower bound exists but its complexity is much less than mutual information (extending G'acs and Korner result [2]). 1 Introduction The family of Turing degrees is a well studied object. Is there any finite analog of it? In the present paper we study such a finite analog: instead of subsets of N we take binary strings and instead of Turing reducibility we take the relation "Kolmogorov complexity of x conditional to y is small". This structure has many properties common with the set of Turing degrees. For example, it forms also an upper semi-lattice. And it is more rich, since we can measure the complexity of finite strings. Of course, we should make precise what means that the Kolmogorov com..

CiteSeerX

Upper semi-lattice of binary strings with the relation “x is simple conditional to y”

Author: Alexander Shen
Alexei Chernov
Andrei Romashchenko
Andrej Muchnik
Gács
Li
Muchnik
Nikolai Vereshchagin
Romashchenko
Shoenfield
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Upper semi-lattice of binary strings with the relation "x is simple conditional to y"

Author: Chernov Alexey
Muchnik Andrej
Romashchenko Andrei
Shen Alexander
Vereshchagin Nikolai
Publication venue: 'Elsevier BV'
Publication date: 28/01/2002
Field of study

AbstractIn this paper we construct a structure R that is a “finite version” of the semi-lattice of Turing degrees. Its elements are strings (technically, sequences of strings) and x⩽y means that K(x|y)=(conditionalKolmogorovcomplexityofx relative to y) is small. We construct two elements in R that do not have greatest lower bound. We give a series of examples that show how natural algebraic constructions give two elements that have lower bound 0 (minimal element) but significant mutual information. (A first example of that kind was constructed by Gács–Körner (Problems Control Inform. Theory 2 (1973) 149) using a completely different technique.) We define a notion of “complexity profile” of the pair of elements of R and give (exact) upper and lower bounds for it in a particular case

Elsevier - Publisher Connector

University of Brighton Research Portal