Search CORE

411 research outputs found

String Inference from Longest-Common-Prefix Array

Author: Kärkkäinen Juha
Piątkowski Marcin
Puglisi Simon J.
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2017
Field of study

Peer reviewe

Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

A representation of a compressed de Bruijn graph for pan-genome analysis that enables search

Author: Beller Timo
Ohlebusch Enno
Publication venue
Publication date: 01/01/2016
Field of study

Recently, Marcus et al. (Bioinformatics 2014) proposed to use a compressed de Bruijn graph to describe the relationship between the genomes of many individuals/strains of the same or closely related species. They devised an

O(n \log g)

time algorithm called splitMEM that constructs this graph directly (i.e., without using the uncompressed de Bruijn graph) based on a suffix tree, where

n

is the total length of the genomes and

g

is the length of the longest genome. In this paper, we present a construction algorithm that outperforms their algorithm in theory and in practice. Moreover, we propose a new space-efficient representation of the compressed de Bruijn graph that adds the possibility to search for a pattern (e.g. an allele - a variant form of a gene) within the pan-genome.Comment: Submitted to Algorithmica special issue of CPM201

arXiv.org e-Print Archive

Springer - Publisher Connector

Enhanced suffix arrays as language models: Virtual k-testable languages

Author: Stehouwer H.
van Zaanen M.
Publication venue
Publication date: 01/01/2010
Field of study

In this article, we propose the use of suffix arrays to efficiently implement n-gram language models with practically unlimited size n. This approach, which is used with synchronous back-off, allows us to distinguish between alternative sequences using large contexts. We also show that we can build this kind of models with additional information for each symbol, such as part-of-speech tags and dependency information. The approach can also be viewed as a collection of virtual k-testable automata. Once built, we can directly access the results of any k-testable automaton generated from the input training data. Synchronous back- off automatically identies the k-testable automaton with the largest feasible k. We have used this approach in several classification tasks

MPG.PuRe

Towards Distance-Based Phylogenetic Inference in Average-Case Linear-Time

Author: Crochemore Maxime
Francisco Alexandre P.
Pissis Solon P.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)
Publication date: 01/01/2017
Field of study

Computing genetic evolution distances among a set of taxa dominates the running time of many phylogenetic inference methods. Most of genetic evolution distance definitions rely, even if indirectly, on computing the pairwise Hamming distance among sequences or profiles. We propose here an average-case linear-time algorithm to compute pairwise Hamming distances among a set of taxa under a given Hamming distance threshold. This article includes both a theoretical analysis and extensive experimental results concerning the proposed algorithm. We further show how this algorithm can be successfully integrated into a well known phylogenetic inference method

Dagstuhl Research Online Publication Server

Speeding up tandem mass spectrometry-based database searching by longest common prefix

Author: Chi Hao
Fu Yan
He Si-Min
Li You
Sun Rui-Xiang
Wang Le-Heng
Wu Yan-Jie
Zhou Chen
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

phyBWT: Alignment-Free Phylogeny via eBWT Positional Clustering

Author: Conte Alessio
Grossi Roberto
Guerrini Veronica
Liti Gianni
Rosone Giovanna
Tattini Lorenzo
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)
Publication date: 01/01/2022
Field of study

Molecular phylogenetics is a fundamental branch of biology. It studies the evolutionary relationships among the individuals of a population through their biological sequences, and may provide insights about the origin and the evolution of viral diseases, or highlight complex evolutionary trajectories. In this paper we develop a method called phyBWT, describing how to use the extended Burrows-Wheeler Transform (eBWT) for a collection of DNA sequences to directly reconstruct phylogeny, bypassing the alignment against a reference genome or de novo assembly. Our phyBWT hinges on the combinatorial properties of the eBWT positional clustering framework. We employ eBWT to detect relevant blocks of the longest shared substrings of varying length (unlike the k-mer-based approaches that need to fix the length k a priori), and build a suitable decomposition leading to a phylogenetic tree, step by step. As a result, phyBWT is a new alignment-, assembly-, and reference-free method that builds a partition tree without relying on the pairwise comparison of sequences, thus avoiding to use a distance matrix to infer phylogeny. The preliminary experimental results on sequencing data show that our method can handle datasets of different types (short reads, contigs, or entire genomes), producing trees of quality comparable to that found in the benchmark phylogeny

INRIA a CCSD electronic archive server

Dagstuhl Research Online Publication Server

Tight Upper and Lower Bounds on Suffix Tree Breadth

Author: Badkobeh Golnaz
Gawrychowski Pawel
Kärkkäinen Juha
Puglisi Simon
Zhukova Bella
Publication venue
Publication date: 01/01/2021
Field of study

The suffix tree - the compacted trie of all the suffixes of a string - is the most important and widely-used data structure in string processing. We consider a natural combinatorial question about suffix trees: for a string S of length n, how many nodes nu(S)(d) can there be at (string) depth d in its suffix tree? We prove nu(n, d) = max(S) (is an element of Sigma n) nu(S)(d) is O ((n/d) log(n/d)), and show that this bound is asymptotically tight, describing strings for which nu(S)(d) is Omega((n/d)log(n/d)). (C) 2020 Elsevier B.V. All rights reserved.Peer reviewe

Goldsmiths Research Online

Helsingin yliopiston digitaalinen arkisto