Search CORE

88,841 research outputs found

Internal Dictionary Matching

Author: Charalampopoulos Panagiotis
Kociumaka Tomasz
Mohamed Manal
Radoszewski Jakub
Rytter Wojciech
Walen Tomasz
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th International Symposium on Algorithms and Computation (ISAAC 2019)
Publication date: 01/01/2019
Field of study

We introduce data structures answering queries concerning the occurrences of patterns from a given dictionary D in fragments of a given string T of length n. The dictionary is internal in the sense that each pattern in D is given as a fragment of T. This way, D takes space proportional to the number of patterns d=|D| rather than their total length, which could be Theta(n * d). In particular, we consider the following types of queries: reporting and counting all occurrences of patterns from D in a fragment T[i..j] (operations Report(i,j) and Count(i,j) below, as well as operation Exists(i,j) that returns true iff Count(i,j)>0) and reporting distinct patterns from D that occur in T[i..j] (operation ReportDistinct(i,j)). We show how to construct, in O((n+d) log^{O(1)} n) time, a data structure that answers each of these queries in time O(log^{O(1)} n+|output|) - see the table below for specific time and space complexities. Query | Preprocessing time | Space | Query time Exists(i,j) | O(n+d) | O(n) | O(1) Report(i,j) | O(n+d) | O(n+d) | O(1+|output|) ReportDistinct(i,j) | O(n log n+d) | O(n+d) | O(log n+|output|) Count(i,j) | O({n log n}/{log log n} + d log^{3/2} n) | O(n+d log n) | O({log^2n}/{log log n}) The case of counting patterns is much more involved and needs a combination of a locally consistent parsing with orthogonal range searching. Reporting distinct patterns, on the other hand, uses the structure of maximal repetitions in strings. Finally, we provide tight - up to subpolynomial factors - upper and lower bounds for the case of a dynamic dictionary

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Counting Distinct Patterns in Internal Dictionary Matching

Author: Charalampopoulos Panagiotis
Kociumaka Tomasz
Mohamed Manal
Radoszewski Jakub
Rytter Wojciech
Straszy?ski Juliusz
Wale? Tomasz
Zuba Wiktor
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)
Publication date: 01/01/2020
Field of study

We consider the problem of preprocessing a text

T

of length

n

and a dictionary

\mathcal{D}

in order to be able to efficiently answer queries

CountDistinct(i,j)

, that is, given

i

and

j

return the number of patterns from

\mathcal{D}

that occur in the fragment

T[i \mathinner{.\,.} j]

. The dictionary is internal in the sense that each pattern in

\mathcal{D}

is given as a fragment of

T

. This way, the dictionary takes space proportional to the number of patterns

d=|\mathcal{D}|

rather than their total length, which could be

\Theta(n\cdot d)

. An

\tilde{\mathcal{O}}(n+d)

-size data structure that answers

CountDistinct(i,j)

queries

\mathcal{O}(\log n)

-approximately in

\tilde{\mathcal{O}}(1)

time was recently proposed in a work that introduced internal dictionary matching [ISAAC 2019]. Here we present an

\tilde{\mathcal{O}}(n+d)

-size data structure that answers

CountDistinct(i,j)

queries

2

-approximately in

\tilde{\mathcal{O}}(1)

time. Using range queries, for any

m

, we give an

\tilde{\mathcal{O}}(\min(nd/m,n^2/m^2)+d)

-size data structure that answers

CountDistinct(i,j)

queries exactly in

\tilde{\mathcal{O}}(m)

time. We also consider the special case when the dictionary consists of all square factors of the string. We design an

\mathcal{O}(n \log^2 n)

-size data structure that allows us to count distinct squares in a text fragment

T[i \mathinner{.\,.} j]

\mathcal{O}(\log n)

time.Comment: Accepted to CPM 202

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Succinct Dictionary Matching With No Slowdown

Author: A.V. Aho
J.I. Munro
K. Sadakane
P. Elias
R.M. Fano
S. Dori
W.-K. Hon
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

The problem of dictionary matching is a classical problem in string matching: given a set S of d strings of total length n characters over an (not necessarily constant) alphabet of size sigma, build a data structure so that we can match in a any text T all occurrences of strings belonging to S. The classical solution for this problem is the Aho-Corasick automaton which finds all occ occurrences in a text T in time O(|T| + occ) using a data structure that occupies O(m log m) bits of space where m <= n + 1 is the number of states in the automaton. In this paper we show that the Aho-Corasick automaton can be represented in just m(log sigma + O(1)) + O(d log(n/d)) bits of space while still maintaining the ability to answer to queries in O(|T| + occ) time. To the best of our knowledge, the currently fastest succinct data structure for the dictionary matching problem uses space O(n log sigma) while answering queries in O(|T|log log n + occ) time. In this paper we also show how the space occupancy can be reduced to m(H0 + O(1)) + O(d log(n/d)) where H0 is the empirical entropy of the characters appearing in the trie representation of the set S, provided that sigma < m^epsilon for any constant 0 < epsilon < 1. The query time remains unchanged.Comment: Corrected typos and other minor error

arXiv.org e-Print Archive

CiteSeerX

Crossref

Automated schema matching techniques: an exploratory study

Author: Rose Ellen
Sun Xiao Long
Publication venue: 'Massey University'
Publication date: 01/01/2003
Field of study

Manual schema matching is a problem for many database applications that use multiple data sources including data warehousing and e-commerce applications. Current research attempts to address this problem by developing algorithms to automate aspects of the schema-matching task. In this paper, an approach using an external dictionary facilitates automated discovery of the semantic meaning of database schema terms. An experimental study was conducted to evaluate the performance and accuracy of five schema-matching techniques with the proposed approach, called SemMA. The proposed approach and results are compared with two existing semi-automated schema-matching approaches and suggestions for future research are made

Massey Research Online

Online Pattern Matching for String Edit Distance with Moves

Author: D. Shapira
G. Navarro
J. Kececioglu
R. Clifford
S. Maruyama
V. Bafna
V.I. Levenshtein
W. Rytter
Publication venue
Publication date: 01/01/2014
Field of study

Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string to the other. Although optimizing EDM is intractable, it has many applications especially in error detections. Edit sensitive parsing (ESP) is an efficient parsing algorithm that guarantees an upper bound of parsing discrepancies between different appearances of the same substrings in a string. ESP can be used for computing an approximate EDM as the L1 distance between characteristic vectors built by node labels in parsing trees. However, ESP is not applicable to a streaming text data where a whole text is unknown in advance. We present an online ESP (OESP) that enables an online pattern matching for EDM. OESP builds a parse tree for a streaming text and computes the L1 distance between characteristic vectors in an online manner. For the space-efficient computation of EDM, OESP directly encodes the parse tree into a succinct representation by leveraging the idea behind recent results of a dynamic succinct tree. We experimentally test OESP on the ability to compute EDM in an online manner on benchmark datasets, and we show OESP's efficiency.Comment: This paper has been accepted to the 21st edition of the International Symposium on String Processing and Information Retrieval (SPIRE2014

arXiv.org e-Print Archive

Crossref

Feature detection using spikes: the greedy approach

Author: Atick
Bair
Barlow
Bell
Borg-Graham
Capobianco
Celebrini
Durka
Grinvald
Jones
Lapicque
Laurent Perrinet
MacKay
Mallat
Mallat
Olshausen
Olshausen
Olshausen
Perrinet
Perrinet
Poynton
Ringach
Ringach
Rosenblatt
Schwartz
Shannon
Simoncelli
Publication venue: 'Elsevier BV'
Publication date: 28/11/2005
Field of study

A goal of low-level neural processes is to build an efficient code extracting the relevant information from the sensory input. It is believed that this is implemented in cortical areas by elementary inferential computations dynamically extracting the most likely parameters corresponding to the sensory signal. We explore here a neuro-mimetic feed-forward model of the primary visual area (VI) solving this problem in the case where the signal may be described by a robust linear generative model. This model uses an over-complete dictionary of primitives which provides a distributed probabilistic representation of input features. Relying on an efficiency criterion, we derive an algorithm as an approximate solution which uses incremental greedy inference processes. This algorithm is similar to 'Matching Pursuit' and mimics the parallel architecture of neural computations. We propose here a simple implementation using a network of spiking integrate-and-fire neurons which communicate using lateral interactions. Numerical simulations show that this Sparse Spike Coding strategy provides an efficient model for representing visual data from a set of natural images. Even though it is simplistic, this transformation of spatial data into a spatio-temporal pattern of binary events provides an accurate description of some complex neural patterns observed in the spiking activity of biological neural networks.Comment: This work links Matching Pursuit with bayesian inference by providing the underlying hypotheses (linear model, uniform prior, gaussian noise model). A parallel with the parallel and event-based nature of neural computations is explored and we show application to modelling Primary Visual Cortex / image processsing. http://incm.cnrs-mrs.fr/perrinet/dynn/LaurentPerrinet/Publications/Perrinet04tau

arXiv.org e-Print Archive

Crossref

HAL AMU