49 research outputs found

    Efficient String Matching on Coded Texts

    Get PDF
    The so called "four Russians technique'' is often used to speed up algorithms by encoding several data items in a single memory cell. Given a sequence of n symbols over a constant size alphabet, one can encode the sequence into O(n / lambda) memory cells in O(log(lambda) ) time using n / log(lambda) processors. This paper presents an efficient CRCW-PRAM string-matching algorithm for coded texts that takes O(log log(m/lambda)) time making only O(n / lambda ) operations, an improvement by a factor of lambda = O(log n) on the number of operations used in previous algorithms. Using this string-matching algorithm one can test if a string is square-free and find all palindromes in a string in O(log log n) time using n / log log n processors

    Efficient Computation of Maximal Anti-Exponent in Palindrome-Free Strings

    Get PDF
    A palindrome is a string x = a1 · · · an which is equal to its reversal x = an · · · a1. We consider gapped palindromes which are strings of the form uvu , where u, v are strings, |v| ≥ 2, and u is the reversal of u. Replicating the standard notion of string exponent, we define the anti- exponent of a gapped palindrome uvu as the quotient of |uvu | by |uv|. To get an efficient computation of maximal anti-exponent of factors in a palindrome-free string, we apply techniques based on the suffix au- tomaton and the reversed Lempel-Ziv factorisation. Our algorithm runs in O(n) time on a fixed-size alphabet or O(n log σ) on a large alphabet, which dramatically outperforms the naive cubic-time solution

    Fast Parallel Lyndon Factorization With Applications

    Get PDF

    Pattern Matching with Variables: Fast Algorithms and New Hardness Results

    Get PDF
    A pattern (i. e., a string of variables and terminals) maps to a word, if this is obtained by uniformly replacing the variables by terminal words; deciding this is NP-complete. We present efficient algorithmsfootnote{The computational model we use is the standard unit-cost RAM with logarithmic word size. Also, all logarithms appearing in our time complexity evaluations are in base 2.} that solve this problem for restricted classes of patterns. Furthermore, we show that it is NP-complete to decide, for a given number k and a word w, whether w can be factorised into k distinct factors; this shows that the injective version (i.e., different variables are replaced by different words) of the above matching problem is NP-complete even for very restricted cases

    An Optimal O(log log n) Time Parallel Algorithm for Detecting all Squares in a String

    Full text link

    Parallel and scalable combinatorial string algorithms on distributed memory systems

    Get PDF
    Methods for processing and analyzing DNA and genomic data are built upon combinatorial graph and string algorithms. The advent of high-throughput DNA sequencing is enabling the generation of billions of reads per experiment. Classical and sequential algorithms can no longer deal with these growing data sizes - which for the last 10 years have greatly out-paced advances in processor speeds. Processing and analyzing state-of-the-art genomic data sets require the design of scalable and efficient parallel algorithms and the use of large computing clusters. Suffix arrays and trees are fundamental string data structures, which lie at the foundation of many string algorithms, with important applications in text processing, information retrieval, and computational biology. Conversely, the parallel construction of these indices is an actively studied problem. However, prior approaches lacked good worst-case run-time guarantees and exhibit poor scaling and overall performance. In this work, we present our distributed-memory parallel algorithms for indexing large datasets, including algorithms for the distributed construction of suffix arrays, LCP arrays, and suffix trees. We formulate a generalized version of the All-Nearest-Smaller-Values problem, provide an optimal distributed solution, and apply it to the distributed construction of suffix trees - yielding a work-optimal parallel algorithm. Our algorithms for distributed suffix array and suffix tree construction improve the state-of-the-art by simultaneously improving worst-case run-time bounds and achieving superior practical performance. Next, we introduce a novel distributed string index, the Distributed Enhanced Suffix Array (DESA) - based on the suffix and LCP arrays, the DESA consists of these and additional distributed data structures. The DESA is designed to allow efficient pattern search queries in distributed memory while requiring at most O(n/p) memory per process. We present efficient distributed-memory parallel algorithms for querying, as well as for the efficient construction of this distributed index. Finally, we present our work on distributed-memory algorithms for clustering de Bruijn graphs and its application to solving a grand challenge metagenomic dataset.Ph.D

    Discovery of Unconventional Patterns for Sequence Analysis: Theory and Algorithms

    Get PDF
    The biology community is collecting a large amount of raw data, such as the genome sequences of organisms, microarray data, interaction data such as gene-protein interactions, protein-protein interactions, etc. This amount is rapidly increasing and the process of understanding the data is lagging behind the process of acquiring it. An inevitable first step towards making sense of the data is to study their regularities focusing on the non-random structures appearing surprisingly often in the input sequences: patterns. In this thesis we discuss three incarnations of the pattern discovery task, exploring three types of patterns that can model different regularities of the input dataset. While mask patterns have been designed to model short repeated biological sequences, showing a high conservation of their content at some specific positions, permutation patterns have been designed to detect repeated patterns whose parts maintain their physical adjacency but not their ordering in all the pattern occurrences. Transposons, instead, model mobile sequences in the input dataset, which can be discovered by comparing different copies of the same input string, detecting large insertions and deletions in their alignment
    corecore