4,735 research outputs found
The asymptotic number of prefix normal words
We show that the number of prefix normal binary words of length is
. We also show that the maximum number of binary
words of length with a given fixed prefix normal form is
.Comment: 9 page
Algorithms and Data Structures for Coding, Indexing, and Mining of Sequential Data
In recent years, the production of sequential data has been rapidly increasing. This requires solving challenging problems about how to represent information, how to retrieve information, and how to extract knowledge, from sequential data. These questions belong to the areas of coding, indexing, and mining, respectively. In this thesis, we investigate problems from those three areas. Coding refers to the way in which information is represented. Coding aims at generating optimal codes, that are codes having a minimum expected length. Codes can be generated for different purposes, from data compression to error detection/correction. The Lempel-Ziv 77 parsing produces an asymptotically optimal code in terms of compression. We study algorithms to efficiently decompress strings from the Lempel-Ziv 77 parsing, using memory proportional to the size of the parsing itself. We provide the first implementation of an algorithm by Bille et al., the only work we are aware of on this problem. We present a practical evaluation of this approach and several optimizations which improve the performance on all datasets we tested. Through the Ulam-R{'e}nyi game, it is possible to provide optimal adaptive error-correcting codes. The game consists of discovering an unknown -bit number by asking membership questions the answers to which can be erroneous. Questions are formulated knowing the answers to all previous ones. We want to find an optimal strategy, i.e., a strategy that can identify any -bit number using the theoretical minimum number of questions. We studied the case where questions are a union of up to a fixed number of intervals, and up to three answers can be erroneous. We first show that for any sufficiently large , there exists a strategy to identify an initially unknown -bit number which uses at most four intervals per question. We further refine our main tool to turn the above asymptotic result into a complete characterization of those instances of the Ulam-R{'e}nyi game that admit optimal strategies. Indexing refers to the way in which information is retrieved. An index for texts permits finding all occurrences of any substring, without traversing the whole text. Many applications require to look for approximate substrings. One of these is the problem of jumbled pattern matching, where two strings match if one is a permutation of the other. We study combinatorial aspects of prefix normal words, a class of binary words introduced in this context. These words can be used as indices for the Indexed Binary Jumbled Pattern Matching problem. We present a new recursive generation algorithm for prefix normal words that is competitive with the previous one but allows to list all prefix normal words sharing the same prefix. This sheds lights on novel insights that may help solving the problem of counting the number of prefix normal words of a given length. We then introduce infinite prefix normal words, and we show that one of the operations used by the algorithm, when repeatedly applied to extend a word, produces an infinite prefix normal word. This motivates the seeking for other operations that produce infinite prefix normal words. We found that one of these operations establishes a connection between prefix normal words and Sturmian words. We also explored the relationship between prefix normal words and Abelian complexity, as well as between prefix normal words and lexicographic order. Mining refers to the way in which information is converted into knowledge. The process of knowledge discovery covers several processing steps, including knowledge extraction. We analyze the problem of mining assertions for an embedded system from its simulation traces. This problem can be modeled as a pattern discovery problem on colored strings. We present two problems of pattern discovery on colored strings: patterns for one color only, or for all colors at the same time. We present two suffix tree-based algorithms. The first algorithm solves both the one color problem and the all colors problem. We then, introduce modifications which improve performance of the algorithm both on synthetic and on real data. We implemented and evaluated the proposed approaches, highlighting time trade-offs that can be obtained. A different way of knowledge extraction is based on the information-theoretic perspective of Pearl's model of causality. It has been postulated that the true causality direction between two phenomena A and B is related to the problem of finding the minimum entropy joint distribution between A and B. This problem is known to be NP-hard, and greedy algorithms have recently been proposed. We provide a novel analysis of one of the proposed heuristic showing that this algorithm guarantees an additive approximation of 1 bit. We then, provide a general criterion for guaranteeing an additive approximation factor of 1. This criterion may be of independent interest in other contexts where couplings are used
Canonical Trees, Compact Prefix-free Codes and Sums of Unit Fractions: A Probabilistic Analysis
For fixed , we consider the class of representations of as sum of
unit fractions whose denominators are powers of or equivalently the class
of canonical compact -ary Huffman codes or equivalently rooted -ary plane
"canonical" trees. We study the probabilistic behaviour of the height (limit
distribution is shown to be normal), the number of distinct summands (normal
distribution), the path length (normal distribution), the width (main term of
the expectation and concentration property) and the number of leaves at maximum
distance from the root (discrete distribution)
Normal, Abby Normal, Prefix Normal
A prefix normal word is a binary word with the property that no substring has
more 1s than the prefix of the same length. This class of words is important in
the context of binary jumbled pattern matching. In this paper we present
results about the number of prefix normal words of length , showing
that for some and
. We introduce efficient
algorithms for testing the prefix normal property and a "mechanical algorithm"
for computing prefix normal forms. We also include games which can be played
with prefix normal words. In these games Alice wishes to stay normal but Bob
wants to drive her "abnormal" -- we discuss which parameter settings allow
Alice to succeed.Comment: Accepted at FUN '1
Multiplicative measures on free groups
We introduce a family of atomic measures on free groups generated by
no-return random walks. These measures are shown to be very convenient for
comparing "relative sizes" of subgroups, context-free and regular subsets
(that, subsets generated by finite automata) of free groups. Many asymptotic
characteristics of subsets and subgroups are naturally expressed as analytic
properties of related generating functions. We introduce an hierarchy of
asymptotic behaviour "at infinity" of subsets in the free groups, more
sensitive than the traditionally used asymptotic density, and apply it to
normal subgroups and regular subsets.Comment: LaTeX, requires amssymb.sty; 31 pp Version 3: more detail in Example
2 and Tauberian theorem
Distributional convergence for the number of symbol comparisons used by QuickSort
Most previous studies of the sorting algorithm QuickSort have used the number
of key comparisons as a measure of the cost of executing the algorithm. Here we
suppose that the n independent and identically distributed (i.i.d.) keys are
each represented as a sequence of symbols from a probabilistic source and that
QuickSort operates on individual symbols, and we measure the execution cost as
the number of symbol comparisons. Assuming only a mild "tameness" condition on
the source, we show that there is a limiting distribution for the number of
symbol comparisons after normalization: first centering by the mean and then
dividing by n. Additionally, under a condition that grows more restrictive as p
increases, we have convergence of moments of orders p and smaller. In
particular, we have convergence in distribution and convergence of moments of
every order whenever the source is memoryless, that is, whenever each key is
generated as an infinite string of i.i.d. symbols. This is somewhat surprising;
even for the classical model that each key is an i.i.d. string of unbiased
("fair") bits, the mean exhibits periodic fluctuations of order n.Comment: Published in at http://dx.doi.org/10.1214/12-AAP866 the Annals of
Applied Probability (http://www.imstat.org/aap/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …