14 research outputs found

    OPTYMALIZACJA PROBLEMU NAJWIĘKSZEJ PODTABLICY DLA SPECYFICZNYCH DANYCH

    Get PDF
    The maximum subarray problem (MSP) is to the find maximum contiguous sum in an array. This paper describes a method of Kadanes algorithm (the state of the art) optimization for specific data (continuous sequences of zeros or negative real numbers). When the data are unfavourable, the modification of the algorithm causes a non significant performance loss (1% > decrease in performance). The modification does not improve time complexity but reduces the number of elementary operations. Various experimental data sets have been used to evaluate possible time efficiency improvement. For the most favourable data sets an increase in efficiency of 25% can be achieved.Problem najwiekszej podtablicy to inaczej znalezienie podciągu, którego suma na największą wartość. Artykuł opisuje optymalizację algorytmu Kadane dla specyficznych danych (z powtarzającymi się ciągami zer lub liczb negatywnych). W przypadku niekorzystnych danych wejściowych zaproponowa modyfikacja nieznacznie spowalnia działanie algorytmu (mniej niż 1% szybkości działania). Ulepszenie algorytmu nie zmienia rzędu asymptotycznego tempa wzrostu, lecz zmniejsza ilość elementarnych operacji. Eksperymenty wykazały, że dla sprzyjających danych możemy zmniejszyć efektywny czas działania algorytmu o 25%

    Locating regions in a sequence under density constraints

    Get PDF
    Several biological problems require the identification of regions in a sequence where some feature occurs within a target density range: examples including the location of GC-rich regions, identification of CpG islands, and sequence matching. Mathematically, this corresponds to searching a string of 0s and 1s for a substring whose relative proportion of 1s lies between given lower and upper bounds. We consider the algorithmic problem of locating the longest such substring, as well as other related problems (such as finding the shortest substring or a maximal set of disjoint substrings). For locating the longest such substring, we develop an algorithm that runs in O(n) time, improving upon the previous best-known O(n log n) result. For the related problems we develop O(n log log n) algorithms, again improving upon the best-known O(n log n) results. Practical testing verifies that our new algorithms enjoy significantly smaller time and memory footprints, and can process sequences that are orders of magnitude longer as a result.Comment: 17 pages, 8 figures; v2: minor revisions, additional explanations; to appear in SIAM Journal on Computin

    An Optimal Algorithm for the Maximum-Density Segment Problem

    Full text link
    We address a fundamental problem arising from analysis of biomolecular sequences. The input consists of two numbers wminw_{\min} and wmaxw_{\max} and a sequence SS of nn number pairs (ai,wi)(a_i,w_i) with wi>0w_i>0. Let {\em segment} S(i,j)S(i,j) of SS be the consecutive subsequence of SS between indices ii and jj. The {\em density} of S(i,j)S(i,j) is d(i,j)=(ai+ai+1+...+aj)/(wi+wi+1+...+wj)d(i,j)=(a_i+a_{i+1}+...+a_j)/(w_i+w_{i+1}+...+w_j). The {\em maximum-density segment problem} is to find a maximum-density segment over all segments S(i,j)S(i,j) with wminwi+wi+1+...+wjwmaxw_{\min}\leq w_i+w_{i+1}+...+w_j \leq w_{\max}. The best previously known algorithm for the problem, due to Goldwasser, Kao, and Lu, runs in O(nlog(wmaxwmin+1))O(n\log(w_{\max}-w_{\min}+1)) time. In the present paper, we solve the problem in O(n) time. Our approach bypasses the complicated {\em right-skew decomposition}, introduced by Lin, Jiang, and Chao. As a result, our algorithm has the capability to process the input sequence in an online manner, which is an important feature for dealing with genome-scale sequences. Moreover, for a type of input sequences SS representable in O(m)O(m) space, we show how to exploit the sparsity of SS and solve the maximum-density segment problem for SS in O(m)O(m) time.Comment: 15 pages, 12 figures, an early version of this paper was presented at 11th Annual European Symposium on Algorithms (ESA 2003), Budapest, Hungary, September 15-20, 200

    Linear-Time Algorithms for Computing Maximum-Density Sequence Segments with Bioinformatics Applications

    Get PDF
    We study an abstract optimization problem arising from biomolecular sequence analysis. For a sequence A of pairs (a_i,w_i) for i = 1,..,n and w_i>0, a segment A(i,j) is a consecutive subsequence of A starting with index i and ending with index j. The width of A(i,j) is w(i,j) = sum_{i <= k <= j} w_k, and the density is (sum_{i<= k <= j} a_k)/ w(i,j). The maximum-density segment problem takes A and two values L and U as input and asks for a segment of A with the largest possible density among those of width at least L and at most U. When U is unbounded, we provide a relatively simple, O(n)-time algorithm, improving upon the O(n \log L)-time algorithm by Lin, Jiang and Chao. When both L and U are specified, there are no previous nontrivial results. We solve the problem in O(n) time if w_i=1 for all i, and more generally in O(n+n\log(U-L+1)) time when w_i>=1 for all i.Comment: 23 pages, 13 figures. A significant portion of these results appeared under the title, "Fast Algorithms for Finding Maximum-Density Segments of a Sequence with Applications to Bioinformatics," in Proceedings of the Second Workshop on Algorithms in Bioinformatics (WABI), volume 2452 of Lecture Notes in Computer Science (Springer-Verlag, Berlin), R. Guigo and D. Gusfield editors, 2002, pp. 157--17

    A method for finding single-nucleotide polymorphisms with allele frequencies in sequences of deep coverage

    Get PDF
    BACKGROUND: The allele frequencies of single-nucleotide polymorphisms (SNPs) are needed to select an optimal subset of common SNPs for use in association studies. Sequence-based methods for finding SNPs with allele frequencies may need to handle thousands of sequences from the same genome location (sequences of deep coverage). RESULTS: We describe a computational method for finding common SNPs with allele frequencies in single-pass sequences of deep coverage. The method enhances a widely used program named PolyBayes in several aspects. We present results from our method and PolyBayes on eighteen data sets of human expressed sequence tags (ESTs) with deep coverage. The results indicate that our method used almost all single-pass sequences in computation of the allele frequencies of SNPs. CONCLUSION: The new method is able to handle single-pass sequences of deep coverage efficiently. Our work shows that it is possible to analyze sequences of deep coverage by using pairwise alignments of the sequences with the finished genome sequence, instead of multiple sequence alignments

    A comparative study of sequence analysis tools in computational biology

    Get PDF
    A biomolecular object, such as a deoxyribonucleic acid (DNA), a ribonucleic acid (RNA) or a protein molecule, is made up of a long chain of subunits. A protein is represented as a sequence made from 20 different amino acids, each represented as a letter. There are a vast number of ways in which similar structural domains can be generated in proteins by different amino acid sequences. By contrast, the structure of DNA, made up of only four different nucleotide building blocks that occur in two pairs, is relatively simple, regular, and predictable. Biomolecular sequence alignment/string search is the most important issue and challenging task in many areas of science and information processing. It involves identifying one-to-one correspondences between subunits of different sequences. An efficient algorithm or tool is involved with many important factors, these include the following: Scoring systems, Alignment statistics, Database redundancy and sequence repetitiveness. Sequence motifs are derived from multiple alignments and can be used to examine individual sequences or an entire database for subtle patterns. With motifs, it is sometimes possible to detect distant relationships that may not be demonstrable based on comparisons of primary sequences alone. A more comprehensive solution to the efficient string search is approached by building a small, representative set of motifs and using this as a screening database with automatic masking of matching query subsequences. This technology is still under development but recent studies indicate that a representative set of only 1,000 - 3,000 sequences may suffice and such a database can be searched in seconds

    Fast and Space-Efficient Location of Heavy or Dense Segments in Run-Length Encoded Sequences

    Get PDF
    This paper considers several variations of an optimization problem with potential applications in such areas as biomolecular sequence analysis and image processing. Given a sequence of items, each with a weight and a length, the goal is to find a subsequence of consecutive items of optimal value, where value is either total weight or total weight divided by total length. There may also be a specified lower and/or upper bound on the acceptable length of subsequences. This paper shows that all the variations of the problem are solvable in linear time and space even with non-uniform item lengths and divisible items, implying that run-length encoded sequences can be handled in time and space linear in the number of runs. Furthermore, some problem variations can be solved in constant space. Also, these time and space bounds suffice for certain problem variations in which we call for reporting of many “good” subsequences
    corecore