1,317 research outputs found

    On the average-case complexity of pattern matching with wildcards

    Get PDF
    Pattern matching with wildcards is a string matching problem with the goal of finding all factors of a text tt of length nn that match a pattern xx of length mm, where wildcards (characters that match everything) may be present. In this paper we present a number of complexity results and fast average-case algorithms for pattern matching where wildcards are allowed in the pattern, however, the results are easily adapted to the case where wildcards are allowed in the text as well. We analyse the \textit{average-case} complexity of these algorithms and derive non-trivial time bounds. These are the first results on the average-case complexity of pattern matching with wildcards which provide a provable separation in time complexity between exact pattern matching and pattern matching with wildcards. We introduce the \textit{wc-period} of a string which is the period of the binary mask xbx_b where xb[i]=ax_b[i]=a \textit{iff} x[i]≠ϕx[i]\neq \phi and bb otherwise. We denote the length of the wc-period of a string xx by \textsc{wcp}(x). We show the following results for constant 0<Ï”<10< \epsilon < 1 and a pattern xx of length mm and gg wildcards with \textsc{wcp}(x)=p the prefix of length pp contains gpg_p wildcards: \begin{itemize} \item If lim⁥m→∞gpp=0\displaystyle\lim_{m \rightarrow \infty} \frac{g_p}{p}=0 there is an optimal algorithm running in \cO(\frac{n \log_\sigma m}{m})-time on average. \item If lim⁥m→∞gpp=1−ϔ\displaystyle\lim_{m \rightarrow \infty} \frac{g_p}{p}=1-\epsilon there is an algorithm running in \cO(\frac{n \log_\sigma m\log_2 p}{m})-time on average. \item If lim⁥m→∞gm=lim⁥m→∞1−f(m)=1\displaystyle\lim_{m \rightarrow \infty} \frac{g}{m} = \displaystyle\lim_{m \rightarrow \infty} 1-f(m)=1 any algorithm takes at least Ω(nlogâĄÏƒmf(m))\Omega(\frac{n \log_\sigma m}{f(m)})-time on average. \end{itemize

    Upper and lower bounds for dynamic data structures on strings

    Get PDF
    We consider a range of simply stated dynamic data structure problems on strings. An update changes one symbol in the input and a query asks us to compute some function of the pattern of length mm and a substring of a longer text. We give both conditional and unconditional lower bounds for variants of exact matching with wildcards, inner product, and Hamming distance computation via a sequence of reductions. As an example, we show that there does not exist an O(m1/2−Δ)O(m^{1/2-\varepsilon}) time algorithm for a large range of these problems unless the online Boolean matrix-vector multiplication conjecture is false. We also provide nearly matching upper bounds for most of the problems we consider.Comment: Accepted at STACS'1

    Technology Mapping for Circuit Optimization Using Content-Addressable Memory

    Get PDF
    The growing complexity of Field Programmable Gate Arrays (FPGA's) is leading to architectures with high input cardinality look-up tables (LUT's). This thesis describes a methodology for area-minimizing technology mapping for combinational logic, specifically designed for such FPGA architectures. This methodology, called LURU, leverages the parallel search capabilities of Content-Addressable Memories (CAM's) to outperform traditional mapping algorithms in both execution time and quality of results. The LURU algorithm is fundamentally different from other techniques for technology mapping in that LURU uses textual string representations of circuit topology in order to efficiently store and search for circuit patterns in a CAM. A circuit is mapped to the target LUT technology using both exact and inexact string matching techniques. Common subcircuit expressions (CSE's) are also identified and used for architectural optimization---a small set of CSE's is shown to effectively cover an average of 96% of the test circuits. LURU was tested with the ISCAS'85 suite of combinational benchmark circuits and compared with the mapping algorithms FlowMap and CutMap. The area reduction shown by LURU is, on average, 20% better compared to FlowMap and CutMap. The asymptotic runtime complexity of LURU is shown to be better than that of both FlowMap and CutMap

    Reverse-Safe Data Structures for Text Indexing

    Get PDF
    We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z-reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D. The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z, we propose an algorithm which constructs a z-reverse-safe data structure that has size O(n) and answers pattern matching queries of length at most d optimally, where d is maximal for any such z-reverse-safe data structure. The construction algorithm takes O(n ω log d) time, where ω is the matrix multiplication exponent. We show that, despite the n ω factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We further show that plugging our method in data analysis applications gives insignificant or no data utility loss. Finally, we show how our technique can be extended to support applications under a realistic adversary model

    Cursive script recognition using wildcards and multiple experts

    Get PDF
    Variability in handwriting styles suggests that many letter recognition engines cannot correctly identify some hand-written letters of poor quality at reasonable computational cost. Methods that are capable of searching the resulting sparse graph of letter candidates are therefore required. The method presented here employs ‘wildcards’ to represent missing letter candidates. Multiple experts are used to represent different aspects of handwriting. Each expert evaluates closeness of match and indicates its confidence. Explanation experts determine the degree to which the word alternative under consideration explains extraneous letter candidates. Schemata for normalisation and combination of scores are investigated and their performance compared. Hill climbing yields near-optimal combination weights that outperform comparable methods on identical dynamic handwriting data

    Constraint-based sequence mining using constraint programming

    Full text link
    The goal of constraint-based sequence mining is to find sequences of symbols that are included in a large number of input sequences and that satisfy some constraints specified by the user. Many constraints have been proposed in the literature, but a general framework is still missing. We investigate the use of constraint programming as general framework for this task. We first identify four categories of constraints that are applicable to sequence mining. We then propose two constraint programming formulations. The first formulation introduces a new global constraint called exists-embedding. This formulation is the most efficient but does not support one type of constraint. To support such constraints, we develop a second formulation that is more general but incurs more overhead. Both formulations can use the projected database technique used in specialised algorithms. Experiments demonstrate the flexibility towards constraint-based settings and compare the approach to existing methods.Comment: In Integration of AI and OR Techniques in Constraint Programming (CPAIOR), 201
    • 

    corecore