329,702 research outputs found
An Algorithm to Compute the Character Access Count Distribution for Pattern Matching Algorithms
We propose a framework for the exact probabilistic
analysis of window-based pattern matching algorithms, such as
Boyer--Moore, Horspool, Backward DAWG Matching, Backward Oracle
Matching, and more. In particular, we develop an algorithm that
efficiently computes the distribution of a pattern matching
algorithm's running time cost (such as the number of text character
accesses) for any given pattern in a random text model. Text models
range from simple uniform models to higher-order Markov models or
hidden Markov models (HMMs). Furthermore, we provide an algorithm to
compute the exact distribution of \emph{differences} in running time
cost of two pattern matching algorithms. Methodologically, we use
extensions of finite automata which we call \emph{deterministic
arithmetic automata} (DAAs) and \emph{probabilistic arithmetic
automata} (PAAs)~\cite{Marschall2008}. Given an algorithm, a
pattern, and a text model, a PAA is constructed from which the
sought distributions can be derived using dynamic programming. To
our knowledge, this is the first time that substring- or
suffix-based pattern matching algorithms are analyzed exactly by
computing the whole distribution of running time cost.
Experimentally, we compare Horspool's algorithm, Backward DAWG
Matching, and Backward Oracle Matching on prototypical patterns of
short length and provide statistics on the size of minimal DAAs for
these computations
A type system for Continuation Calculus
Continuation Calculus (CC), introduced by Geron and Geuvers, is a simple
foundational model for functional computation. It is closely related to lambda
calculus and term rewriting, but it has no variable binding and no pattern
matching. It is Turing complete and evaluation is deterministic. Notions like
"call-by-value" and "call-by-name" computation are available by choosing
appropriate function definitions: e.g. there is a call-by-value and a
call-by-name addition function. In the present paper we extend CC with types,
to be able to define data types in a canonical way, and functions over these
data types, defined by iteration. Data type definitions follow the so-called
"Scott encoding" of data, as opposed to the more familiar "Church encoding".
The iteration scheme comes in two flavors: a call-by-value and a call-by-name
iteration scheme. The call-by-value variant is a double negation variant of
call-by-name iteration. The double negation translation allows to move between
call-by-name and call-by-value.Comment: In Proceedings CL&C 2014, arXiv:1409.259
On the Comparison Complexity of the String Prefix-Matching Problem
In this paper we study the exact comparison complexity of the stringprefix-matching problem in the deterministic sequential comparison modelwith equality tests. We derive almost tight lower and upper bounds onthe number of symbol comparisons required in the worst case by on-lineprefix-matching algorithms for any fixed pattern and variable text. Unlikeprevious results on the comparison complexity of string-matching andprefix-matching algorithms, our bounds are almost tight for any particular pattern.We also consider the special case where the pattern and the text are thesame string. This problem, which we call the string self-prefix problem, issimilar to the pattern preprocessing step of the Knuth-Morris-Pratt string-matchingalgorithm that is used in several comparison efficient string-matchingand prefix-matching algorithms, including in our new algorithm.We obtain roughly tight lower and upper bounds on the number of symbolcomparisons required in the worst case by on-line self-prefix algorithms.Our algorithms can be implemented in linear time and space in thestandard uniform-cost random-access-machine model
Optimal public money
In most countries, the supply of paper money is controlled by a state institution. This paper provides an explanation for why such an arrangement is typically chosen. I use a deterministic matching model with a continuum of agents where enforcement is limited and where some agents produce public goods. Agents can also, at a cost, produce a distinguishable, intrinsically useless but perfectly durable good: notes. I call a note public if it is printed by an agent who produces public goods. In this framework, I prove that the socially optimal allocation is only implemented by a pattern of trade in which exchanges are effected using public notes. JEL Classification: D8, E5Limited Commitment, Money
Indexing Isodirectional Pointer Sequences
Many sequential and temporal data have dependency relationships among their elements, which can be represented as a sequence of pointers. In this paper, we introduce a new string matching problem with a particular type of strings, which we call isodirectional pointer sequence, in which each entry has a pointer to another entry. The proposed problem is not only a formalization of real-world dependency matching problems, but also a generalization of variants of the string matching problem such as parameterized pattern matching and Cartesian tree matching. We present a 2nlg?+2n+o(n)-bit index that preprocesses the text T[1:n] so as to count the number of occurrences of pattern P[1:m] in ?(mlg?) where ? is the number of distinct lengths of pointers in T. Our index is also easily implementable in practice because it consists of wavelet trees and range maximum query index, which are widely used building blocks in many other compact data structures. By compressing the wavelet trees, the index can also be stored into 2nH^*?(T)+2n+o(n) bits where H^*?(T) is the 0-th order empirical entropy of the distribution of pointer lengths of T
MRSI: A Fast Pattern Matching Algorithm for Anti-virus Applications
Anti-virus applications play an important role in today’s Internet communication security. Virus scanning is usually performed on email, web and file transfer traffic flows at intranet security gateways. The performance of popular anti-virus applications relies on the pattern matching algorithms implemented in these security devices. The growth of network bandwidth and the increase of virus signatures call for high speed and scalable pattern matching algorithms. Motivated by several observations of a real-life virus signature database from Clam-AV, a popular anti-virus application, a fast pattern matching algorithm named MRSI is proposed in this paper. Compared to the current algorithm implemented in Clam-AV, MRSI achieved an 80%~100 % faster virus scanning speed without excessive memory usages
pBWT: Achieving succinct data structures for parameterized pattern matching and related problems
The fields of succinct data structures and compressed text indexing have seen quite a bit of progress over the last two decades. An important achievement, primarily using techniques based on the Burrows-Wheeler Transform (BWT), was obtaining the full functionality of the suffix tree in the optimal number of bits. A crucial property that allows the use of BWT for designing compressed indexes is order-preserving suffix links. Specifically, the relative order between two suffixes in the subtree of an internal node is same as that of the suffixes obtained by truncating the furst character of the two suffixes. Unfortunately, in many variants of the text-indexing problem, for e.g., parameterized pattern matching, 2D pattern matching, and order-isomorphic pattern matching, this property does not hold. Consequently, the compressed indexes based on BWT do not directly apply. Furthermore, a compressed index for any of these variants has been elusive throughout the advancement of the field of succinct data structures. We achieve a positive breakthrough on one such problem, namely the Parameterized Pattern Matching problem. Let T be a text that contains n characters from an alphabet , which is the union of two disjoint sets: containing static characters (s-characters) and containing parameterized characters (p-characters). A pattern P (also over ) matches an equal-length substring S of T i the s-characters match exactly, and there exists a one-to-one function that renames the p-characters in S to that in P. The task is to find the starting positions (occurrences) of all such substrings S. Previous index [Baker, STOC 1993], known as Parameterized Suffix Tree, requires (n log n) bits of space, and can find all occ occurrences in time O(jPj log +occ), where = jj. We introduce an n log +O(n)-bit index with O(jPj log +occlog n log ) query time. At the core, lies a new BWT-like transform, which we call the Parame- terized Burrows-Wheeler Transform (pBWT). The techniques are extended to obtain a succinct index for the Parameterized Dictionary Matching problem of Idury and Schaer [CPM, 1994]
Well-Founded Recursion over Contextual Objects
We present a core programming language that supports writing well-founded structurally recursive functions using simultaneous pattern matching on contextual LF objects and contexts. The main technical tool is a coverage checking algorithm that also generates valid recursive calls. To establish consistency, we define a call-by-value small-step semantics and prove that every well-typed program terminates using a reducibility semantics. Based on the presented methodology we have implemented a totality checker as part of the programming and proof environment Beluga where it can be used to establish that a total Beluga program corresponds to a proof
- …