653 research outputs found
Pattern Matching and Consensus Problems on Weighted Sequences and Profiles
We study pattern matching problems on two major representations of uncertain
sequences used in molecular biology: weighted sequences (also known as position
weight matrices, PWM) and profiles (i.e., scoring matrices). In the simple
version, in which only the pattern or only the text is uncertain, we obtain
efficient algorithms with theoretically-provable running times using a
variation of the lookahead scoring technique. We also consider a general
variant of the pattern matching problems in which both the pattern and the text
are uncertain. Central to our solution is a special case where the sequences
have equal length, called the consensus problem. We propose algorithms for the
consensus problem parameterized by the number of strings that match one of the
sequences. As our basic approach, a careful adaptation of the classic
meet-in-the-middle algorithm for the knapsack problem is used. On the lower
bound side, we prove that our dependence on the parameter is optimal up to
lower-order terms conditioned on the optimality of the original algorithm for
the knapsack problem.Comment: 22 page
Linear-time Computation of Minimal Absent Words Using Suffix Array
An absent word of a word y of length n is a word that does not occur in y. It
is a minimal absent word if all its proper factors occur in y. Minimal absent
words have been computed in genomes of organisms from all domains of life;
their computation provides a fast alternative for measuring approximation in
sequence comparison. There exists an O(n)-time and O(n)-space algorithm for
computing all minimal absent words on a fixed-sized alphabet based on the
construction of suffix automata (Crochemore et al., 1998). No implementation of
this algorithm is publicly available. There also exists an O(n^2)-time and
O(n)-space algorithm for the same problem based on the construction of suffix
arrays (Pinho et al., 2009). An implementation of this algorithm was also
provided by the authors and is currently the fastest available. In this
article, we bridge this unpleasant gap by presenting an O(n)-time and
O(n)-space algorithm for computing all minimal absent words based on the
construction of suffix arrays. Experimental results using real and synthetic
data show that the respective implementation outperforms the one by Pinho et
al
Direct laser printing of thin-film polyaniline devices
We report the fabrication of electrically functional polyaniline thin-film
microdevices. Polyaniline films were printed in the solid phase by Laser
Induced Forward Transfer directly between Au electrodes on a Si/SiO2 substrate.
To apply solid-phase deposition, aniline was in situ polymerized on quartz
substrates. Laser deposition preserves the morphology of the films and delivers
sharp features with controllable dimensions. The electrical characteristics of
printed polyaniline present ohmic behavior, allowing for electroactive
applications. Results on gas sensing of ammonia are presented.Comment: In Pres
Reverse-Safe Data Structures for Text Indexing
We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z-reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D. The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z, we propose an algorithm which constructs a z-reverse-safe data structure that has size O(n) and answers pattern matching queries of length at most d optimally, where d is maximal for any such z-reverse-safe data structure. The construction algorithm takes O(n ω log d) time, where ω is the matrix multiplication exponent. We show that, despite the n ω factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We further show that plugging our method in data analysis applications gives insignificant or no data utility loss. Finally, we show how our technique can be extended to support applications under a realistic adversary model
Linear-Time Superbubble Identification Algorithm for Genome Assembly
DNA sequencing is the process of determining the exact order of the
nucleotide bases of an individual's genome in order to catalogue sequence
variation and understand its biological implications. Whole-genome sequencing
techniques produce masses of data in the form of short sequences known as
reads. Assembling these reads into a whole genome constitutes a major
algorithmic challenge. Most assembly algorithms utilize de Bruijn graphs
constructed from reads for this purpose. A critical step of these algorithms is
to detect typical motif structures in the graph caused by sequencing errors and
genome repeats, and filter them out; one such complex subgraph class is a
so-called superbubble. In this paper, we propose an O(n+m)-time algorithm to
detect all superbubbles in a directed acyclic graph with n nodes and m
(directed) edges, improving the best-known O(m log m)-time algorithm by Sung et
al
Towards Distance-Based Phylogenetic Inference in Average-Case Linear-Time
Computing genetic evolution distances among a set of taxa dominates the running time of many phylogenetic inference methods. Most of genetic evolution distance definitions rely, even if indirectly, on computing the pairwise Hamming distance among sequences or profiles. We propose here an average-case linear-time algorithm to compute pairwise Hamming distances among a set of taxa under a given Hamming distance threshold. This article includes both a theoretical analysis and extensive experimental results concerning the proposed algorithm. We further show how this algorithm can be successfully integrated into a well known phylogenetic inference method
Efficient Computation of Sequence Mappability
Sequence mappability is an important task in genome re-sequencing. In the
-mappability problem, for a given sequence of length , our goal
is to compute a table whose th entry is the number of indices such
that length- substrings of starting at positions and have at
most mismatches. Previous works on this problem focused on heuristic
approaches to compute a rough approximation of the result or on the case of
. We present several efficient algorithms for the general case of the
problem. Our main result is an algorithm that works in time and space for
. It requires a carefu l adaptation of the technique of Cole
et al.~[STOC 2004] to avoid multiple counting of pairs of substrings. We also
show -time algorithms to compute all results for a fixed
and all or a fixed and all . Finally we show
that the -mappability problem cannot be solved in strongly subquadratic
time for unless the Strong Exponential Time Hypothesis
fails.Comment: Accepted to SPIRE 201
Mode Coupling relaxation scenario in a confined glass former
Molecular dynamics simulations of a Lennard-Jones binary mixture confined in
a disordered array of soft spheres are presented. The single particle dynamical
behavior of the glass former is examined upon supercooling. Predictions of mode
coupling theory are satisfied by the confined liquid. Estimates of the
crossover temperature are obtained by power law fit to the diffusion
coefficients and relaxation times of the late region. The exponent
of the von Schweidler law is also evaluated. Similarly to the bulk, different
values of the exponent are extracted from the power law fit to the
diffusion coefficients and relaxation times.Comment: 5 pages, 4 figures, changes in the text, accepted for publication on
Europhysics Letter
- …