Search CORE

5 research outputs found

Occupancy distributions in Markov chains via Doeblin's ergodicity coefficient

Author: Chestnut Stephen
Lladser Manuel
Publication venue
Publication date: 01/01/2010
Field of study

We apply Doeblin's ergodicity coefficient as a computational tool to approximate the occupancy distribution of a set of states in a homogeneous but possibly non-stationary finite Markov chain. Our approximation is based on new properties satisfied by this coefficient, which allow us to approximate a chain of duration n by independent and short-lived realizations of an auxiliary homogeneous Markov chain of duration of order ln(n). Our approximation may be particularly useful when exact calculations via first-step methods or transfer matrices are impractical, and asymptotic approximations may not be yet reliable. Our findings may find applications to pattern problems in Markovian and non-Markovian sequences that are treatable via embedding techniques.Comment: 12 pages, 2 table

arXiv.org e-Print Archive

CiteSeerX

An Algorithm to Compute the Character Access Count Distribution for Pattern Matching Algorithms

Author: Marschall T. (Tobias)
Rahmann S. (Sven)
Publication venue: 'MDPI AG'
Publication date: 01/10/2011
Field of study

We propose a framework for the exact probabilistic analysis of window-based pattern matching algorithms, such as Boyer--Moore, Horspool, Backward DAWG Matching, Backward Oracle Matching, and more. In particular, we develop an algorithm that efficiently computes the distribution of a pattern matching algorithm's running time cost (such as the number of text character accesses) for any given pattern in a random text model. Text models range from simple uniform models to higher-order Markov models or hidden Markov models (HMMs). Furthermore, we provide an algorithm to compute the exact distribution of \emph{differences} in running time cost of two pattern matching algorithms. Methodologically, we use extensions of finite automata which we call \emph{deterministic arithmetic automata} (DAAs) and \emph{probabilistic arithmetic automata} (PAAs)~\cite{Marschall2008}. Given an algorithm, a pattern, and a text model, a PAA is constructed from which the sought distributions can be derived using dynamic programming. To our knowledge, this is the first time that substring- or suffix-based pattern matching algorithms are analyzed exactly by computing the whole distribution of running time cost. Experimentally, we compare Horspool's algorithm, Backward DAWG Matching, and Backward Oracle Matching on prototypical patterns of short length and provide statistics on the size of minimal DAAs for these computations

CWI's Institutional Repository

Regexpcount, a Symbolic Package for Counting Problems on Regular Expressions and Words

Author: Pierre Nicod Eme
Pierre Nicodème
Publication venue
Publication date
Field of study

In previous work [10], we considered algorithms related to the statistics of matches with words and regular expressions in texts generated by Bernoulli or Markov sources. In this work these algorithms are extended for two purposes: to determine the statistics of simultaneous counting of different motifs, and to compute the waiting time for the first match with a motif in a model which may be constrained. This extension also handles matches with errors. The package is fully implemented and gives access to high and low level commands. We also consider an example corresponding to a practical biological problem: getting the statistics for the number of matches of words of size 8 in a genome (a Markovian sequence), knowing that an (overrepresented DNA protecting) pattern named Chi occurs a given number of times

CiteSeerX

Regexpcount, a Symbolic Package for Counting Problems on Regular Expressions and Words

Author: Pierre Nicodème
Publication venue
Publication date
Field of study

In previous work (Nicod`eme et al., 1999), we considered algorithms related to the statistics of word occurrences and regular expression occurrences in texts generated by Bernoulli or Markov sources. In this work these algorithms are extended for two purposes: to determine the statistics of simultaneous counting of different motifs, and to compute the waiting time for the first match with a motif in a model which may be constrained. This extension also handles matches with errors. The package is fully implemented and gives access to high and low level commands. We also consider an example corresponding to a practical biological problem: getting the statistics for the number of matches of words of size 8 in a genome (a Markovian sequence), knowing that an (overrepresented DNA protecting) Chi pattern occurs a given number of times

CiteSeerX