38,520 research outputs found

    Dictionary matching in a stream

    Get PDF
    We consider the problem of dictionary matching in a stream. Given a set of strings, known as a dictionary, and a stream of characters arriving one at a time, the task is to report each time some string in our dictionary occurs in the stream. We present a randomised algorithm which takes O(log log(k + m)) time per arriving character and uses O(k log m) words of space, where k is the number of strings in the dictionary and m is the length of the longest string in the dictionary

    Rancang Bangun Aplikasi Kamus Fisika Dasar Menggunakan Algoritma String Matching Brute Force

    Get PDF
    Dictionary is a kind of reference book that is composed by abzad and lists of words and their meanings. Dictionaries are needed in the world of education to figure out the word that we want to know its meaning. Dictionary of physics is composed of various terms and explanations, which, if used as an application then the search he will take a long time, because the mobile is not able to display all terms, to ease the problem of finding the word, the dictionary is designed using the algorithm string matching. String matching algorithm is an algorithm used to solve the problem of matching the text to other texts. String algorithm used is brute force algorithm

    Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

    Full text link
    We study the approximate string matching and regular expression matching problem for the case when the text to be searched is compressed with the Ziv-Lempel adaptive dictionary compression schemes. We present a time-space trade-off that leads to algorithms improving the previously known complexities for both problems. In particular, we significantly improve the space bounds, which in practical applications are likely to be a bottleneck

    A practical index for approximate dictionary matching with few mismatches

    Get PDF
    Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in qq-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low

    RANCANG BANGUN APLIKASI KAMUS FISIKA DASAR MENGGUNAKAN ALGORITMA STRING MATCHING BRUTE FORCE

    Get PDF
    Dictionary is a kind of reference book that is composed by abzad and lists of words and their meanings. Dictionaries are needed in the world of education to figure out the word that we want to know its meaning. Dictionary of physics is composed of various terms and explanations, which, if used as an application then the search he will take a long time, because the mobile is not able to display all terms, to ease the problem of finding the word, the dictionary is designed using the algorithm string matching. String matching algorithm is an algorithm used to solve the problem of matching the text to other texts. String algorithm used is brute force algorithm

    Succinct Dictionary Matching With No Slowdown

    Full text link
    The problem of dictionary matching is a classical problem in string matching: given a set S of d strings of total length n characters over an (not necessarily constant) alphabet of size sigma, build a data structure so that we can match in a any text T all occurrences of strings belonging to S. The classical solution for this problem is the Aho-Corasick automaton which finds all occ occurrences in a text T in time O(|T| + occ) using a data structure that occupies O(m log m) bits of space where m <= n + 1 is the number of states in the automaton. In this paper we show that the Aho-Corasick automaton can be represented in just m(log sigma + O(1)) + O(d log(n/d)) bits of space while still maintaining the ability to answer to queries in O(|T| + occ) time. To the best of our knowledge, the currently fastest succinct data structure for the dictionary matching problem uses space O(n log sigma) while answering queries in O(|T|log log n + occ) time. In this paper we also show how the space occupancy can be reduced to m(H0 + O(1)) + O(d log(n/d)) where H0 is the empirical entropy of the characters appearing in the trie representation of the set S, provided that sigma < m^epsilon for any constant 0 < epsilon < 1. The query time remains unchanged.Comment: Corrected typos and other minor error

    The complexity of the Multiple Pattern Matching Problem for random strings

    Full text link
    We generalise a multiple string pattern matching algorithm, recently proposed by Fredriksson and Grabowski [J. Discr. Alg. 7, 2009], to deal with arbitrary dictionaries on an alphabet of size ss. If rmr_m is the number of words of length mm in the dictionary, and ϕ(r)=maxmln(smrm)/m\phi(r) = \max_m \ln(s\, m\, r_m)/m, the complexity rate for the string characters to be read by this algorithm is at most κUBϕ(r)\kappa_{{}_\textrm{UB}}\, \phi(r) for some constant κUB\kappa_{{}_\textrm{UB}}. On the other side, we generalise the classical lower bound of Yao [SIAM J. Comput. 8, 1979], for the problem with a single pattern, to deal with arbitrary dictionaries, and determine it to be at least κLBϕ(r)\kappa_{{}_\textrm{LB}}\, \phi(r). This proves the optimality of the algorithm, improving and correcting previous claims.Comment: 25 pages, 4 figure

    Pattern Masking for Dictionary Matching:Theory and Practice

    Get PDF
    Data masking is a common technique for sanitizing sensitive data maintained in database systems which is becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem: given a dictionary D of d strings, each of length ℓ, a query string q of length ℓ, and a positive integer z, we are asked to compute a smallest set K⊆{1, …, ℓ}, so that if q[i] is replaced by a wildcard for all i∈K, then q matches at least z strings from D. Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for binary strings. We thus approach the problem from a more practical perspective. We show a combinatorial O((dℓ)|K|/3+dℓ)-time and O(dℓ)-space algorithm for PMDM for |K|=O(1). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails (Abboud et al. in SIAM J Comput 47:2527–2555, 2018; Lincoln et al., in: 29th ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018). Our combinatorial algorithm, executed with small |K|, is the backbone of a greedy heuristic that we propose. Our experiments on real-world and synthetic datasets show that our heuristic finds nearly-optimal solutions in practice and is also very efficient. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in D. PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and z=1, one obtains the minimal number of mismatches of q with any string from D. The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time O(2ℓd). We present a data structure for PMDM that answers queries over D in time O(2ℓ/2(2ℓ/2+τ)ℓ) and requires space O(2ℓd2/τ2+2ℓ/2d), for any parameter τ∈[1, d]. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáč et al., ACM-SIAM Symposium on Discrete Algorithms (SODA) 2017]. This gives a polynomial-time O(d1/4+ϵ)-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture. This is an extended version of a paper that was presented at International Symposium on Algorithms and Computation (ISAAC) 2021
    corecore