88,841 research outputs found
Internal Dictionary Matching
We introduce data structures answering queries concerning the occurrences of patterns from a given dictionary D in fragments of a given string T of length n. The dictionary is internal in the sense that each pattern in D is given as a fragment of T. This way, D takes space proportional to the number of patterns d=|D| rather than their total length, which could be Theta(n * d).
In particular, we consider the following types of queries: reporting and counting all occurrences of patterns from D in a fragment T[i..j] (operations Report(i,j) and Count(i,j) below, as well as operation Exists(i,j) that returns true iff Count(i,j)>0) and reporting distinct patterns from D that occur in T[i..j] (operation ReportDistinct(i,j)). We show how to construct, in O((n+d) log^{O(1)} n) time, a data structure that answers each of these queries in time O(log^{O(1)} n+|output|) - see the table below for specific time and space complexities.
Query | Preprocessing time | Space | Query time
Exists(i,j) | O(n+d) | O(n) | O(1)
Report(i,j) | O(n+d) | O(n+d) | O(1+|output|)
ReportDistinct(i,j) | O(n log n+d) | O(n+d) | O(log n+|output|)
Count(i,j) | O({n log n}/{log log n} + d log^{3/2} n) | O(n+d log n) | O({log^2n}/{log log n})
The case of counting patterns is much more involved and needs a combination of a locally consistent parsing with orthogonal range searching. Reporting distinct patterns, on the other hand, uses the structure of maximal repetitions in strings. Finally, we provide tight - up to subpolynomial factors - upper and lower bounds for the case of a dynamic dictionary
Counting Distinct Patterns in Internal Dictionary Matching
We consider the problem of preprocessing a text of length and a
dictionary in order to be able to efficiently answer queries
, that is, given and return the number of patterns
from that occur in the fragment . The
dictionary is internal in the sense that each pattern in is given
as a fragment of . This way, the dictionary takes space proportional to the
number of patterns rather than their total length, which
could be . An -size data structure
that answers queries -approximately
in time was recently proposed in a work that
introduced internal dictionary matching [ISAAC 2019]. Here we present an
-size data structure that answers
queries -approximately in
time. Using range queries, for any , we give an
-size data structure that answers
queries exactly in time. We also
consider the special case when the dictionary consists of all square factors of
the string. We design an -size data structure that
allows us to count distinct squares in a text fragment in time.Comment: Accepted to CPM 202
Succinct Dictionary Matching With No Slowdown
The problem of dictionary matching is a classical problem in string matching:
given a set S of d strings of total length n characters over an (not
necessarily constant) alphabet of size sigma, build a data structure so that we
can match in a any text T all occurrences of strings belonging to S. The
classical solution for this problem is the Aho-Corasick automaton which finds
all occ occurrences in a text T in time O(|T| + occ) using a data structure
that occupies O(m log m) bits of space where m <= n + 1 is the number of states
in the automaton. In this paper we show that the Aho-Corasick automaton can be
represented in just m(log sigma + O(1)) + O(d log(n/d)) bits of space while
still maintaining the ability to answer to queries in O(|T| + occ) time. To the
best of our knowledge, the currently fastest succinct data structure for the
dictionary matching problem uses space O(n log sigma) while answering queries
in O(|T|log log n + occ) time. In this paper we also show how the space
occupancy can be reduced to m(H0 + O(1)) + O(d log(n/d)) where H0 is the
empirical entropy of the characters appearing in the trie representation of the
set S, provided that sigma < m^epsilon for any constant 0 < epsilon < 1. The
query time remains unchanged.Comment: Corrected typos and other minor error
Automated schema matching techniques: an exploratory study
Manual schema matching is a problem for many database applications that use multiple data sources including data warehousing and e-commerce applications. Current research attempts to address this problem by developing algorithms to automate aspects of the schema-matching task. In this paper, an approach using an external dictionary facilitates automated discovery of the semantic meaning of database schema terms. An experimental study was conducted to evaluate the performance and accuracy of five schema-matching techniques with the proposed approach, called SemMA. The proposed approach and results are compared with two existing semi-automated schema-matching approaches and suggestions for future research are made
Online Pattern Matching for String Edit Distance with Moves
Edit distance with moves (EDM) is a string-to-string distance measure that
includes substring moves in addition to ordinal editing operations to turn one
string to the other. Although optimizing EDM is intractable, it has many
applications especially in error detections. Edit sensitive parsing (ESP) is an
efficient parsing algorithm that guarantees an upper bound of parsing
discrepancies between different appearances of the same substrings in a string.
ESP can be used for computing an approximate EDM as the L1 distance between
characteristic vectors built by node labels in parsing trees. However, ESP is
not applicable to a streaming text data where a whole text is unknown in
advance. We present an online ESP (OESP) that enables an online pattern
matching for EDM. OESP builds a parse tree for a streaming text and computes
the L1 distance between characteristic vectors in an online manner. For the
space-efficient computation of EDM, OESP directly encodes the parse tree into a
succinct representation by leveraging the idea behind recent results of a
dynamic succinct tree. We experimentally test OESP on the ability to compute
EDM in an online manner on benchmark datasets, and we show OESP's efficiency.Comment: This paper has been accepted to the 21st edition of the International
Symposium on String Processing and Information Retrieval (SPIRE2014
Feature detection using spikes: the greedy approach
A goal of low-level neural processes is to build an efficient code extracting
the relevant information from the sensory input. It is believed that this is
implemented in cortical areas by elementary inferential computations
dynamically extracting the most likely parameters corresponding to the sensory
signal. We explore here a neuro-mimetic feed-forward model of the primary
visual area (VI) solving this problem in the case where the signal may be
described by a robust linear generative model. This model uses an over-complete
dictionary of primitives which provides a distributed probabilistic
representation of input features. Relying on an efficiency criterion, we derive
an algorithm as an approximate solution which uses incremental greedy inference
processes. This algorithm is similar to 'Matching Pursuit' and mimics the
parallel architecture of neural computations. We propose here a simple
implementation using a network of spiking integrate-and-fire neurons which
communicate using lateral interactions. Numerical simulations show that this
Sparse Spike Coding strategy provides an efficient model for representing
visual data from a set of natural images. Even though it is simplistic, this
transformation of spatial data into a spatio-temporal pattern of binary events
provides an accurate description of some complex neural patterns observed in
the spiking activity of biological neural networks.Comment: This work links Matching Pursuit with bayesian inference by providing
the underlying hypotheses (linear model, uniform prior, gaussian noise
model). A parallel with the parallel and event-based nature of neural
computations is explored and we show application to modelling Primary Visual
Cortex / image processsing.
http://incm.cnrs-mrs.fr/perrinet/dynn/LaurentPerrinet/Publications/Perrinet04tau
- …