1 research outputs found
An Extensible, Scalable Spark Platform for Alignment-free Genomic Analysis -- Version 2
Motivation: Alignment-free distance and similarity functions (AF functions,
for short) are a computationally convenient alternative to two and multiple
sequence alignments for many genomic, metagenomic and epigenomic tasks. Yet,
their use is still to the proof of principle stage: only recently a
benchmarking study has coherently evaluated a handful of the functions proposed
over the years, identifying a pool of well performing ones. However, more is
needed to make this pool usable on a day-to-day basis. In particular, a
statistical significance quantification associated to the output of a given
function would greatly help when no reference point is available. For most
functions, such an analysis is bound to be based on Monte Carlo Hypothesis Test
simulations, yielding a dramatic increase in computational time that transforms
this into a Big Data problem. Surprisingly, it has been hardly considered,
despite the increasing popularity of Big Data Technologies in Computational
Biology. Results: We fill this important gap by providing the first
user-friendly, extensible, efficient Spark platform for Alignment-free genomic
analysis. Thanks to its scalability, Monte Carlo Hypothesis Test simulations on
the output of AF functions can seamlessly be afforded for either small or huge
collections of sequences. Thus, we are able to comparatively study for the
first time AF functions in relation to the statistical significance of their
output. Such novel analysis allows us to reduce the pool of well performing
functions coming from the benchmarking study to a handful of them