Search CORE

26,117 research outputs found

Alignment-free Genomic Analysis via a Big Data Spark Platform

Author: Cattaneo Giuseppe
Giancarlo Raffaele
Palini Francesco
Petrillo Umberto Ferraro
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2021
Field of study

Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in Computational Biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for Alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE

arXiv.org e-Print Archive

Archivio della ricerca- Università di Roma La Sapienza

The distribution of word matches between Markovian sequences with periodic boundary conditions

Author: Burden Conrad J
Foret Sylvain
Leopardi Paul
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2014
Field of study

Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D2 statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D2 statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D2 distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D2 statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D2 distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D2 distribution from the human genome

The Australian National University

Sparse approaches for the exact distribution of patterns in long state sequences generated by a Markov source

Author: Aho
Allauzen
Antzoulakos
Beaudoing
Boeva
Boeva
Brazma
Chang
Cormen
Cowan
Crochemore
Crochemore
Denise
El~Karoui
Erhardsson
Fiduccia
Frith
Fu
Geske
Godbole
Gregory Nuel
Hampson
Hopcroft
Hopcroft
Jean-Guillaume Dumas
Kaltofen
Karlin
Kleffe
Knuth
Le~Maout
Lladser
Mariño-Ramírez
Nicodème
Nuel
Nuel
Nuel
Nuel
Nuel
Nuel
Nuel
Pevzner
Prum
Reinert
Ribeca
Régnier
Stefanov
Stefanov
Storjohann
van Helden
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

We present two novel approaches for the computation of the exact distribution of a pattern in a long sequence. Both approaches take into account the sparse structure of the problem and are two-part algorithms. The first approach relies on a partial recursion after a fast computation of the second largest eigenvalue of the transition matrix of a Markov chain embedding. The second approach uses fast Taylor expansions of an exact bivariate rational reconstruction of the distribution. We illustrate the interest of both approaches on a simple toy-example and two biological applications: the transcription factors of the Human Chromosome 5 and the PROSITE signatures of functional motifs in proteins. On these example our methods demonstrate their complementarity and their hability to extend the domain of feasibility for exact computations in pattern problems to a new level

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

Hal-Diderot