10,171 research outputs found
Linear-Time Algorithms for Computing Maximum-Density Sequence Segments with Bioinformatics Applications
We study an abstract optimization problem arising from biomolecular sequence
analysis. For a sequence A of pairs (a_i,w_i) for i = 1,..,n and w_i>0, a
segment A(i,j) is a consecutive subsequence of A starting with index i and
ending with index j. The width of A(i,j) is w(i,j) = sum_{i <= k <= j} w_k, and
the density is (sum_{i<= k <= j} a_k)/ w(i,j). The maximum-density segment
problem takes A and two values L and U as input and asks for a segment of A
with the largest possible density among those of width at least L and at most
U. When U is unbounded, we provide a relatively simple, O(n)-time algorithm,
improving upon the O(n \log L)-time algorithm by Lin, Jiang and Chao. When both
L and U are specified, there are no previous nontrivial results. We solve the
problem in O(n) time if w_i=1 for all i, and more generally in
O(n+n\log(U-L+1)) time when w_i>=1 for all i.Comment: 23 pages, 13 figures. A significant portion of these results appeared
under the title, "Fast Algorithms for Finding Maximum-Density Segments of a
Sequence with Applications to Bioinformatics," in Proceedings of the Second
Workshop on Algorithms in Bioinformatics (WABI), volume 2452 of Lecture Notes
in Computer Science (Springer-Verlag, Berlin), R. Guigo and D. Gusfield
editors, 2002, pp. 157--17
Algorithms for the Problems of Length-Constrained Heaviest Segments
We present algorithms for length-constrained maximum sum segment and maximum
density segment problems, in particular, and the problem of finding
length-constrained heaviest segments, in general, for a sequence of real
numbers. Given a sequence of n real numbers and two real parameters L and U (L
<= U), the maximum sum segment problem is to find a consecutive subsequence,
called a segment, of length at least L and at most U such that the sum of the
numbers in the subsequence is maximum. The maximum density segment problem is
to find a segment of length at least L and at most U such that the density of
the numbers in the subsequence is the maximum. For the first problem with
non-uniform width there is an algorithm with time and space complexities in
O(n). We present an algorithm with time complexity in O(n) and space complexity
in O(U). For the second problem with non-uniform width there is a combinatorial
solution with time complexity in O(n) and space complexity in O(U). We present
a simple geometric algorithm with the same time and space complexities.
We extend our algorithms to respectively solve the length-constrained k
maximum sum segments problem in O(n+k) time and O(max{U, k}) space, and the
length-constrained maximum density segments problem in O(n min{k, U-L})
time and O(U+k) space. We present extensions of our algorithms to find all the
length-constrained segments having user specified sum and density in O(n+m) and
O(nlog (U-L)+m) times respectively, where m is the number of output.
Previously, there was no known algorithm with non-trivial result for these
problems. We indicate the extensions of our algorithms to higher dimensions.
All the algorithms can be extended in a straight forward way to solve the
problems with non-uniform width and non-uniform weight.Comment: 21 pages, 12 figure
Locating regions in a sequence under density constraints
Several biological problems require the identification of regions in a
sequence where some feature occurs within a target density range: examples
including the location of GC-rich regions, identification of CpG islands, and
sequence matching. Mathematically, this corresponds to searching a string of 0s
and 1s for a substring whose relative proportion of 1s lies between given lower
and upper bounds. We consider the algorithmic problem of locating the longest
such substring, as well as other related problems (such as finding the shortest
substring or a maximal set of disjoint substrings). For locating the longest
such substring, we develop an algorithm that runs in O(n) time, improving upon
the previous best-known O(n log n) result. For the related problems we develop
O(n log log n) algorithms, again improving upon the best-known O(n log n)
results. Practical testing verifies that our new algorithms enjoy significantly
smaller time and memory footprints, and can process sequences that are orders
of magnitude longer as a result.Comment: 17 pages, 8 figures; v2: minor revisions, additional explanations; to
appear in SIAM Journal on Computin
An Optimal Algorithm for the Maximum-Density Segment Problem
We address a fundamental problem arising from analysis of biomolecular
sequences. The input consists of two numbers and and a
sequence of number pairs with . Let {\em segment}
of be the consecutive subsequence of between indices and
. The {\em density} of is
. The {\em maximum-density
segment problem} is to find a maximum-density segment over all segments
with . The best
previously known algorithm for the problem, due to Goldwasser, Kao, and Lu,
runs in time. In the present paper, we solve
the problem in O(n) time. Our approach bypasses the complicated {\em right-skew
decomposition}, introduced by Lin, Jiang, and Chao. As a result, our algorithm
has the capability to process the input sequence in an online manner, which is
an important feature for dealing with genome-scale sequences. Moreover, for a
type of input sequences representable in space, we show how to
exploit the sparsity of and solve the maximum-density segment problem for
in time.Comment: 15 pages, 12 figures, an early version of this paper was presented at
11th Annual European Symposium on Algorithms (ESA 2003), Budapest, Hungary,
September 15-20, 200
Searching a bitstream in linear time for the longest substring of any given density
Given an arbitrary bitstream, we consider the problem of finding the longest
substring whose ratio of ones to zeroes equals a given value. The central
result of this paper is an algorithm that solves this problem in linear time.
The method involves (i) reformulating the problem as a constrained walk through
a sparse matrix, and then (ii) developing a data structure for this sparse
matrix that allows us to perform each step of the walk in amortised constant
time. We also give a linear time algorithm to find the longest substring whose
ratio of ones to zeroes is bounded below by a given value. Both problems have
practical relevance to cryptography and bioinformatics.Comment: 22 pages, 19 figures; v2: minor edits and enhancement
ecp: An R Package for Nonparametric Multiple Change Point Analysis of Multivariate Data
There are many different ways in which change point analysis can be
performed, from purely parametric methods to those that are distribution free.
The ecp package is designed to perform multiple change point analysis while
making as few assumptions as possible. While many other change point methods
are applicable only for univariate data, this R package is suitable for both
univariate and multivariate observations. Estimation can be based upon either a
hierarchical divisive or agglomerative algorithm. Divisive estimation
sequentially identifies change points via a bisection algorithm. The
agglomerative algorithm estimates change point locations by determining an
optimal segmentation. Both approaches are able to detect any type of
distributional change within the data. This provides an advantage over many
existing change point algorithms which are only able to detect changes within
the marginal distributions
- …