2,118 research outputs found
String Indexing for Patterns with Wildcards
We consider the problem of indexing a string of length to report the
occurrences of a query pattern containing characters and wildcards.
Let be the number of occurrences of in , and the size of
the alphabet. We obtain the following results.
- A linear space index with query time .
This significantly improves the previously best known linear space index by Lam
et al. [ISAAC 2007], which requires query time in the worst case.
- An index with query time using space , where is the maximum number of wildcards allowed in the pattern.
This is the first non-trivial bound with this query time.
- A time-space trade-off, generalizing the index by Cole et al. [STOC 2004].
We also show that these indexes can be generalized to allow variable length
gaps in the pattern. Our results are obtained using a novel combination of
well-known and new techniques, which could be of independent interest
MatchPy: A Pattern Matching Library
Pattern matching is a powerful tool for symbolic computations, based on the
well-defined theory of term rewriting systems. Application domains include
algebraic expressions, abstract syntax trees, and XML and JSON data.
Unfortunately, no lightweight implementation of pattern matching as general and
flexible as Mathematica exists for Python Mathics,MacroPy,patterns,PyPatt.
Therefore, we created the open source module MatchPy which offers similar
pattern matching functionality in Python using a novel algorithm which finds
matches for large pattern sets more efficiently by exploiting similarities
between patterns.Comment: arXiv admin note: substantial text overlap with arXiv:1710.0007
A Parameterized Study of Maximum Generalized Pattern Matching Problems
The generalized function matching (GFM) problem has been intensively studied
starting with [Ehrenfeucht and Rozenberg, 1979]. Given a pattern p and a text
t, the goal is to find a mapping from the letters of p to non-empty substrings
of t, such that applying the mapping to p results in t. Very recently, the
problem has been investigated within the framework of parameterized complexity
[Fernau, Schmid, and Villanger, 2013].
In this paper we study the parameterized complexity of the optimization
variant of GFM (called Max-GFM), which has been introduced in [Amir and Nor,
2007]. Here, one is allowed to replace some of the pattern letters with some
special symbols "?", termed wildcards or don't cares, which can be mapped to an
arbitrary substring of the text. The goal is to minimize the number of
wildcards used.
We give a complete classification of the parameterized complexity of Max-GFM
and its variants under a wide range of parameterizations, such as, the number
of occurrences of a letter in the text, the size of the text alphabet, the
number of occurrences of a letter in the pattern, the size of the pattern
alphabet, the maximum length of a string matched to any pattern letter, the
number of wildcards and the maximum size of a string that a wildcard can be
mapped to.Comment: to appear in Proc. IPEC'1
Quootstrap: Scalable Unsupervised Extraction of Quotation-Speaker Pairs from Large News Corpora via Bootstrapping
We propose Quootstrap, a method for extracting quotations, as well as the
names of the speakers who uttered them, from large news corpora. Whereas prior
work has addressed this problem primarily with supervised machine learning, our
approach follows a fully unsupervised bootstrapping paradigm. It leverages the
redundancy present in large news corpora, more precisely, the fact that the
same quotation often appears across multiple news articles in slightly
different contexts. Starting from a few seed patterns, such as ["Q", said S.],
our method extracts a set of quotation-speaker pairs (Q, S), which are in turn
used for discovering new patterns expressing the same quotations; the process
is then repeated with the larger pattern set. Our algorithm is highly scalable,
which we demonstrate by running it on the large ICWSM 2011 Spinn3r corpus.
Validating our results against a crowdsourced ground truth, we obtain 90%
precision at 40% recall using a single seed pattern, with significantly higher
recall values for more frequently reported (and thus likely more interesting)
quotations. Finally, we showcase the usefulness of our algorithm's output for
computational social science by analyzing the sentiment expressed in our
extracted quotations.Comment: Accepted at the 12th International Conference on Web and Social Media
(ICWSM), 201
Technology Mapping for Circuit Optimization Using Content-Addressable Memory
The growing complexity of Field Programmable Gate Arrays (FPGA's) is leading to architectures with high input cardinality look-up tables (LUT's). This thesis describes a methodology for area-minimizing technology mapping for combinational logic, specifically designed for such FPGA architectures. This methodology, called LURU, leverages the parallel search capabilities of Content-Addressable Memories (CAM's) to outperform traditional mapping algorithms in both execution time and quality of results. The LURU algorithm is fundamentally different from other techniques for technology mapping in that LURU uses textual string representations of circuit topology in order to efficiently store and search for circuit patterns in a CAM. A circuit is mapped to the target LUT technology using both exact and inexact string matching techniques. Common subcircuit expressions (CSE's) are also identified and used for architectural optimization---a small set of CSE's is shown to effectively cover an average of 96% of the test circuits. LURU was tested with the ISCAS'85 suite of combinational benchmark circuits and compared with the mapping algorithms FlowMap and CutMap. The area reduction shown by LURU is, on average, 20% better compared to FlowMap and CutMap. The asymptotic runtime complexity of LURU is shown to be better than that of both FlowMap and CutMap
- …