256,525 research outputs found
Linear-Time Algorithms for Computing Maximum-Density Sequence Segments with Bioinformatics Applications
We study an abstract optimization problem arising from biomolecular sequence
analysis. For a sequence A of pairs (a_i,w_i) for i = 1,..,n and w_i>0, a
segment A(i,j) is a consecutive subsequence of A starting with index i and
ending with index j. The width of A(i,j) is w(i,j) = sum_{i <= k <= j} w_k, and
the density is (sum_{i<= k <= j} a_k)/ w(i,j). The maximum-density segment
problem takes A and two values L and U as input and asks for a segment of A
with the largest possible density among those of width at least L and at most
U. When U is unbounded, we provide a relatively simple, O(n)-time algorithm,
improving upon the O(n \log L)-time algorithm by Lin, Jiang and Chao. When both
L and U are specified, there are no previous nontrivial results. We solve the
problem in O(n) time if w_i=1 for all i, and more generally in
O(n+n\log(U-L+1)) time when w_i>=1 for all i.Comment: 23 pages, 13 figures. A significant portion of these results appeared
under the title, "Fast Algorithms for Finding Maximum-Density Segments of a
Sequence with Applications to Bioinformatics," in Proceedings of the Second
Workshop on Algorithms in Bioinformatics (WABI), volume 2452 of Lecture Notes
in Computer Science (Springer-Verlag, Berlin), R. Guigo and D. Gusfield
editors, 2002, pp. 157--17
Streaming Similarity Self-Join
We introduce and study the problem of computing the similarity self-join in a
streaming context (SSSJ), where the input is an unbounded stream of items
arriving continuously. The goal is to find all pairs of items in the stream
whose similarity is greater than a given threshold. The simplest formulation of
the problem requires unbounded memory, and thus, it is intractable. To make the
problem feasible, we introduce the notion of time-dependent similarity: the
similarity of two items decreases with the difference in their arrival time. By
leveraging the properties of this time-dependent similarity function, we design
two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch
(MB), uses existing index-based filtering techniques for the static version of
the problem, and combines them in a pipeline. The second framework, Streaming
(STR), adds time filtering to the existing indexes, and integrates new
time-based bounds deeply in the working of the algorithms. We also introduce a
new indexing technique (L2), which is based on an existing state-of-the-art
indexing technique (L2AP), but is optimized for the streaming case. Extensive
experiments show that the STR algorithm, when instantiated with the L2 index,
is the most scalable option across a wide array of datasets and parameters
- …