4,028 research outputs found
BlastGraph: intensive approximate pattern matching in string graphs and de-Bruijn graphs
International audienceAbstract. Many de novo assembly tools have been created these last few years to assemble short reads generated by high throughput sequencing platforms. The core of almost all these assemblers is a string graph data structure that links reads together. This motivates our work: BlastGraph, a new algorithm performing intensive approximate string matching between a set of query sequences and a string graph. Our approach is similar to blast-like algorithms and additionally presents specificity due to the matching on the graph data structure. Our results show that BlastGraph performances permit its usage on large graphs in reasonable time. We propose a Cytoscape plug-in for visualizing results as well as a command line program. These programs are available at http://alcovna.genouest.org/blastree/
Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index
Finding approximate occurrences of a pattern in a text using a full-text
index is a central problem in bioinformatics and has been extensively
researched. Bidirectional indices have opened new possibilities in this regard
allowing the search to start from anywhere within the pattern and extend in
both directions. In particular, use of search schemes (partitioning the pattern
and searching the pieces in certain orders with given bounds on errors) can
yield significant speed-ups. However, finding optimal search schemes is a
difficult combinatorial optimization problem.
Here for the first time, we propose a mixed integer program (MIP) capable to
solve this optimization problem for Hamming distance with given number of
pieces. Our experiments show that the optimal search schemes found by our MIP
significantly improve the performance of search in bidirectional FM-index upon
previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina
reads (with two errors) becomes 35 times faster than standard backtracking.
Moreover, despite being performed purely in the index, the running time of
search using our optimal schemes (for up to two errors) is comparable to the
best state-of-the-art aligners, which benefit from combining search in index
with in-text verification using dynamic programming. As a result, we anticipate
a full-fledged aligner that employs an intelligent combination of search in the
bidirectional FM-index using our optimal search schemes and in-text
verification using dynamic programming outperforms today's best aligners. The
development of such an aligner, called FAMOUS (Fast Approximate string Matching
using OptimUm search Schemes), is ongoing as our future work
Do Read Errors Matter for Genome Assembly?
While most current high-throughput DNA sequencing technologies generate short
reads with low error rates, emerging sequencing technologies generate long
reads with high error rates. A basic question of interest is the tradeoff
between read length and error rate in terms of the information needed for the
perfect assembly of the genome. Using an adversarial erasure error model, we
make progress on this problem by establishing a critical read length, as a
function of the genome and the error rate, above which perfect assembly is
guaranteed. For several real genomes, including those from the GAGE dataset, we
verify that this critical read length is not significantly greater than the
read length required for perfect assembly from reads without errors.Comment: Submitted to ISIT 201
Behavioral Learning of Aircraft Landing Sequencing Using a Society of Probabilistic Finite State Machines
Air Traffic Control (ATC) is a complex safety critical environment. A tower
controller would be making many decisions in real-time to sequence aircraft.
While some optimization tools exist to help the controller in some airports,
even in these situations, the real sequence of the aircraft adopted by the
controller is significantly different from the one proposed by the optimization
algorithm. This is due to the very dynamic nature of the environment. The
objective of this paper is to test the hypothesis that one can learn from the
sequence adopted by the controller some strategies that can act as heuristics
in decision support tools for aircraft sequencing. This aim is tested in this
paper by attempting to learn sequences generated from a well-known sequencing
method that is being used in the real world. The approach relies on a genetic
algorithm (GA) to learn these sequences using a society Probabilistic
Finite-state Machines (PFSMs). Each PFSM learns a different sub-space; thus,
decomposing the learning problem into a group of agents that need to work
together to learn the overall problem. Three sequence metrics (Levenshtein,
Hamming and Position distances) are compared as the fitness functions in GA. As
the results suggest, it is possible to learn the behavior of the
algorithm/heuristic that generated the original sequence from very limited
information
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
- …