32,456 research outputs found

    Exact Online String Matching Bibliography

    Full text link
    In this short note we present a comprehensive bibliography for the online exact string matching problem. The problem consists in finding all occurrences of a given pattern in a text. It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemistry. Since 1970 more than 120 string matching algorithms have been proposed. In this note we present a comprehensive list of (almost) all string matching algorithms. The list is updated to May 2016.Comment: 23 page

    Fast Algorithms for Exact String Matching

    Full text link
    Given a pattern string PP of length nn and a query string TT of length mm, where the characters of PP and TT are drawn from an alphabet of size Δ\Delta, the {\em exact string matching} problem consists of finding all occurrences of PP in TT. For this problem, we present algorithms that in O(nΔ2)O(n\Delta^2) time pre-process PP to essentially identify sparse(P)sparse(P), a rarely occurring substring of PP, and then use it to find occurrences of PP in TT efficiently. Our algorithms require a worst case search time of O(m)O(m), and expected search time of O(m/min(sparse(P),Δ))O(m/min(|sparse(P)|, \Delta)), where sparse(P)|sparse(P)| is at least δ\delta (i.e. the number of distinct characters in PP), and for most pattern strings it is observed to be Ω(n1/2)\Omega(n^{1/2})

    A Fast Heuristic for Exact String Matching

    Full text link
    Given a pattern string PP of length nn consisting of δ\delta distinct characters and a query string TT of length mm, where the characters of PP and TT are drawn from an alphabet Σ\Sigma of size Δ\Delta, the {\em exact string matching} problem consists of finding all occurrences of PP in TT. For this problem, we present a randomized heuristic that in O(nδ)O(n\delta) time preprocesses PP to identify sparse(P)sparse(P), a rarely occurring substring of PP, and then use it to find all occurrences of PP in TT efficiently. This heuristic has an expected search time of O(mmin(sparse(P),Δ))O( \frac{m}{min(|sparse(P)|, \Delta)}), where sparse(P)|sparse(P)| is at least δ\delta. We also show that for a pattern string PP whose characters are chosen uniformly at random from an alphabet of size Δ\Delta, E[sparse(P)]E[|sparse(P)|] is Ω(Δlog(2Δ2Δδ))\Omega(\Delta log (\frac{2\Delta}{2\Delta-\delta})).Comment: arXiv admin note: substantial text overlap with arXiv:1509.0922

    On the Average-case Complexity of Pattern Matching with Wildcards

    Full text link
    Pattern matching with wildcards is the problem of finding all factors of a text tt of length nn that match a pattern xx of length mm, where wildcards (characters that match everything) may be present. In this paper we present a number of fast average-case algorithms for pattern matching where wildcards are restricted to either the pattern or the text, however, the results are easily adapted to the case where wildcards are allowed in both. We analyse the \textit{average-case} complexity of these algorithms and show the first non-trivial time bounds. These are the first results on the average-case complexity of pattern matching with wildcards which, as a by product, provide with first provable separation in complexity between exact pattern matching and pattern matching with wildcards in the word RAM model

    New algorithms for binary jumbled pattern matching

    Full text link
    Given a pattern PP and a text TT, both strings over a binary alphabet, the binary jumbled string matching problem consists in telling whether any permutation of PP occurs in TT. The indexed version of this problem, i.e., preprocessing a string to efficiently answer such permutation queries, is hard and has been studied in the last few years. Currently the best bounds for this problem are O(n2/log2n)O(n^2/\log^2 n) (with O(n) space and O(1) query time) and O(r2logr)O(r^2\log r) (with O(|L|) space and O(logL)O(\log|L|) query time), where rr is the length of the run-length encoding of TT and L=O(n)|L| = O(n) is the size of the index. In this paper we present new results for this problem. Our first result is an alternative construction of the index by Badkobeh et al. that obtains a trade-off between the space and the time complexity. It has O(r2logk+n/k)O(r^2\log k + n/k) complexity to build the index, O(logk)O(\log k) query time, and uses O(n/k+L)O(n/k + |L|) space, where kk is a parameter. The second result is an O(n2log2w/w)O(n^2 \log^2 w / w) algorithm (with O(n) space and O(1) query time), based on word-level parallelism where ww is the word size in bits

    A Hybrid Parallel Implementation of the Aho-Corasick and Wu-Manber Algorithms Using NVIDIA CUDA and MPI Evaluated on a Biological Sequence Database

    Full text link
    Multiple matching algorithms are used to locate the occurrences of patterns from a finite pattern set in a large input string. Aho-Corasick and Wu-Manber, two of the most well known algorithms for multiple matching require an increased computing power, particularly in cases where large-size datasets must be processed, as is common in computational biology applications. Over the past years, Graphics Processing Units (GPUs) have evolved to powerful parallel processors outperforming Central Processing Units (CPUs) in scientific calculations. Moreover, multiple GPUs can be used in parallel, forming hybrid computer cluster configurations to achieve an even higher processing throughput. This paper evaluates the speedup of the parallel implementation of the Aho-Corasick and Wu-Manber algorithms on a hybrid GPU cluster, when used to process a snapshot of the Expressed Sequence Tags of the human genome and for different problem parameters

    New Error Tolerant Method to Search Long Repeats in Symbol Sequences

    Full text link
    A new method to identify all sufficiently long repeating substrings in one or several symbol sequences is proposed. The method is based on a specific gauge applied to symbol sequences that guarantees identification of the repeating substrings. It allows the matching of substrings to contain a given level of errors. The gauge is based on the development of a heavily sparse dictionary of repeats, thus drastically accelerating the search procedure. Some genomic applications illustrate the method. This paper is the extended and detailed version of the presentation at the third International Conference on Algorithms for Computational Biology to be held at Trujillo, Spain, June 21-22, 2016.Comment: 13 pages, 4 figure

    Multiple pattern matching revisited

    Full text link
    We consider the classical exact multiple string matching problem. Our solution is based on qq-grams combined with pattern superimposition, bit-parallelism and alphabet size reduction. We discuss the pros and cons of the various alternatives of how to achieve best combination. Our method is closely related to previous work by (Salmela et al., 2006). The experimental results show that our method performs well on different alphabet sizes and that they scale to large pattern sets

    End-to-End Entity Resolution for Big Data: A Survey

    Full text link
    One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions

    Parallel decompression of gzip-compressed files and random access to DNA sequences

    Full text link
    Decompressing a file made by the gzip program at an arbitrary location is in principle impossible, due to the nature of the DEFLATE compression algorithm. Consequently, no existing program can take advantage of parallelism to rapidly decompress large gzip-compressed files. This is an unsatisfactory bottleneck, especially for the analysis of large sequencing data experiments. Here we propose a parallel algorithm and an implementation, pugz, that performs fast and exact decompression of any text file. We show that pugz is an order of magnitude faster than gunzip, and 5x faster than a highly-optimized sequential implementation (libdeflate). We also study the related problem of random access to compressed data. We give simple models and experimental results that shed light on the structure of gzip-compressed files containing DNA sequences. Preliminary results show that random access to sequences within a gzip-compressed FASTQ file is almost always feasible at low compression levels, yet is approximate at higher compression levels.Comment: HiCOMB'1
    corecore