88,841 research outputs found

    Internal Dictionary Matching

    Get PDF
    We introduce data structures answering queries concerning the occurrences of patterns from a given dictionary D in fragments of a given string T of length n. The dictionary is internal in the sense that each pattern in D is given as a fragment of T. This way, D takes space proportional to the number of patterns d=|D| rather than their total length, which could be Theta(n * d). In particular, we consider the following types of queries: reporting and counting all occurrences of patterns from D in a fragment T[i..j] (operations Report(i,j) and Count(i,j) below, as well as operation Exists(i,j) that returns true iff Count(i,j)>0) and reporting distinct patterns from D that occur in T[i..j] (operation ReportDistinct(i,j)). We show how to construct, in O((n+d) log^{O(1)} n) time, a data structure that answers each of these queries in time O(log^{O(1)} n+|output|) - see the table below for specific time and space complexities. Query | Preprocessing time | Space | Query time Exists(i,j) | O(n+d) | O(n) | O(1) Report(i,j) | O(n+d) | O(n+d) | O(1+|output|) ReportDistinct(i,j) | O(n log n+d) | O(n+d) | O(log n+|output|) Count(i,j) | O({n log n}/{log log n} + d log^{3/2} n) | O(n+d log n) | O({log^2n}/{log log n}) The case of counting patterns is much more involved and needs a combination of a locally consistent parsing with orthogonal range searching. Reporting distinct patterns, on the other hand, uses the structure of maximal repetitions in strings. Finally, we provide tight - up to subpolynomial factors - upper and lower bounds for the case of a dynamic dictionary

    Counting Distinct Patterns in Internal Dictionary Matching

    Get PDF
    We consider the problem of preprocessing a text TT of length nn and a dictionary D\mathcal{D} in order to be able to efficiently answer queries CountDistinct(i,j)CountDistinct(i,j), that is, given ii and jj return the number of patterns from D\mathcal{D} that occur in the fragment T[i..j]T[i \mathinner{.\,.} j]. The dictionary is internal in the sense that each pattern in D\mathcal{D} is given as a fragment of TT. This way, the dictionary takes space proportional to the number of patterns d=Dd=|\mathcal{D}| rather than their total length, which could be Θ(nd)\Theta(n\cdot d). An O~(n+d)\tilde{\mathcal{O}}(n+d)-size data structure that answers CountDistinct(i,j)CountDistinct(i,j) queries O(logn)\mathcal{O}(\log n)-approximately in O~(1)\tilde{\mathcal{O}}(1) time was recently proposed in a work that introduced internal dictionary matching [ISAAC 2019]. Here we present an O~(n+d)\tilde{\mathcal{O}}(n+d)-size data structure that answers CountDistinct(i,j)CountDistinct(i,j) queries 22-approximately in O~(1)\tilde{\mathcal{O}}(1) time. Using range queries, for any mm, we give an O~(min(nd/m,n2/m2)+d)\tilde{\mathcal{O}}(\min(nd/m,n^2/m^2)+d)-size data structure that answers CountDistinct(i,j)CountDistinct(i,j) queries exactly in O~(m)\tilde{\mathcal{O}}(m) time. We also consider the special case when the dictionary consists of all square factors of the string. We design an O(nlog2n)\mathcal{O}(n \log^2 n)-size data structure that allows us to count distinct squares in a text fragment T[i..j]T[i \mathinner{.\,.} j] in O(logn)\mathcal{O}(\log n) time.Comment: Accepted to CPM 202

    Succinct Dictionary Matching With No Slowdown

    Full text link
    The problem of dictionary matching is a classical problem in string matching: given a set S of d strings of total length n characters over an (not necessarily constant) alphabet of size sigma, build a data structure so that we can match in a any text T all occurrences of strings belonging to S. The classical solution for this problem is the Aho-Corasick automaton which finds all occ occurrences in a text T in time O(|T| + occ) using a data structure that occupies O(m log m) bits of space where m <= n + 1 is the number of states in the automaton. In this paper we show that the Aho-Corasick automaton can be represented in just m(log sigma + O(1)) + O(d log(n/d)) bits of space while still maintaining the ability to answer to queries in O(|T| + occ) time. To the best of our knowledge, the currently fastest succinct data structure for the dictionary matching problem uses space O(n log sigma) while answering queries in O(|T|log log n + occ) time. In this paper we also show how the space occupancy can be reduced to m(H0 + O(1)) + O(d log(n/d)) where H0 is the empirical entropy of the characters appearing in the trie representation of the set S, provided that sigma < m^epsilon for any constant 0 < epsilon < 1. The query time remains unchanged.Comment: Corrected typos and other minor error

    Automated schema matching techniques: an exploratory study

    Get PDF
    Manual schema matching is a problem for many database applications that use multiple data sources including data warehousing and e-commerce applications. Current research attempts to address this problem by developing algorithms to automate aspects of the schema-matching task. In this paper, an approach using an external dictionary facilitates automated discovery of the semantic meaning of database schema terms. An experimental study was conducted to evaluate the performance and accuracy of five schema-matching techniques with the proposed approach, called SemMA. The proposed approach and results are compared with two existing semi-automated schema-matching approaches and suggestions for future research are made

    Online Pattern Matching for String Edit Distance with Moves

    Full text link
    Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string to the other. Although optimizing EDM is intractable, it has many applications especially in error detections. Edit sensitive parsing (ESP) is an efficient parsing algorithm that guarantees an upper bound of parsing discrepancies between different appearances of the same substrings in a string. ESP can be used for computing an approximate EDM as the L1 distance between characteristic vectors built by node labels in parsing trees. However, ESP is not applicable to a streaming text data where a whole text is unknown in advance. We present an online ESP (OESP) that enables an online pattern matching for EDM. OESP builds a parse tree for a streaming text and computes the L1 distance between characteristic vectors in an online manner. For the space-efficient computation of EDM, OESP directly encodes the parse tree into a succinct representation by leveraging the idea behind recent results of a dynamic succinct tree. We experimentally test OESP on the ability to compute EDM in an online manner on benchmark datasets, and we show OESP's efficiency.Comment: This paper has been accepted to the 21st edition of the International Symposium on String Processing and Information Retrieval (SPIRE2014

    Feature detection using spikes: the greedy approach

    Full text link
    A goal of low-level neural processes is to build an efficient code extracting the relevant information from the sensory input. It is believed that this is implemented in cortical areas by elementary inferential computations dynamically extracting the most likely parameters corresponding to the sensory signal. We explore here a neuro-mimetic feed-forward model of the primary visual area (VI) solving this problem in the case where the signal may be described by a robust linear generative model. This model uses an over-complete dictionary of primitives which provides a distributed probabilistic representation of input features. Relying on an efficiency criterion, we derive an algorithm as an approximate solution which uses incremental greedy inference processes. This algorithm is similar to 'Matching Pursuit' and mimics the parallel architecture of neural computations. We propose here a simple implementation using a network of spiking integrate-and-fire neurons which communicate using lateral interactions. Numerical simulations show that this Sparse Spike Coding strategy provides an efficient model for representing visual data from a set of natural images. Even though it is simplistic, this transformation of spatial data into a spatio-temporal pattern of binary events provides an accurate description of some complex neural patterns observed in the spiking activity of biological neural networks.Comment: This work links Matching Pursuit with bayesian inference by providing the underlying hypotheses (linear model, uniform prior, gaussian noise model). A parallel with the parallel and event-based nature of neural computations is explored and we show application to modelling Primary Visual Cortex / image processsing. http://incm.cnrs-mrs.fr/perrinet/dynn/LaurentPerrinet/Publications/Perrinet04tau
    corecore