11 research outputs found

    The effect of flexible parsing for dynamic dictionary-based data compression

    Full text link

    Lempel-Ziv-Yokoo データ圧縮法の簡素な実現と冗長度の実験的解析

    Get PDF
     Lempel-Ziv (LZ)アルゴリズムは,1978年にZivとLempelによって提案された実用的な辞書エンコーダです.Lempel-Ziv-Yokoo (LZY) は,増分辞書木を利用した簡単な無歪データ圧縮法であり,類似したLZ78法に較べて学習効率が高い方法である. 本研究では,LZYのアルゴリズムに対して,後処理(Post_Processing)法を組み込み,増分辞書木の作り方を変えて,1ビット毎に増分辞書木を更新する簡素な実現方法を与える.そこではまず辞書の構成および符号化・復号化において、数え上げ符号で実現し,さらに簡単な後処理メカニズムを用いる.それらの結果は符号器・復号器間の双対性の高い,プログラム複雑度の小さい方法が実現できた.さて,LZY圧縮法の冗長度に関しては,冗長度の理論的解析と実験的解析の二つ問題が未解決である.そこで本研究では冗長度の実験的解析を行うことを目的とする. 論文では, まず,第二章では,LZYアルゴリズムの実現方法(辞書構成、符号化、復号化)について述べる. 第三章では,LZYアルゴリズムに対するソースコードの構成について述べる. 第四章では,冗長度の実験的解析を行う.その結果.これまで理論的に解析されている冗長度はO(loglogN/logN)であるが,実験結果では明らかにO(1/logN)であると見てとれる. 最後に,第五章では,今後の研究のために,LZYのアルゴリズムの冗長度の理論的解析についてを説明した.電気通信大学201

    On the variance of a class of inductive valuations of data structures for digital search

    Get PDF
    AbstractLet an inductive valuation L on the family of binary tries or Patricia tries or digital search trees be defined in the following way: L(t) = L(tl) + L(tr) + R(t), where tl and tr denote the left and right subtrees of t and R depends only on the size (the number of records) ¦t¦ of t. Let LN denote L restricted to the trees of size N. In Theorem 1 we give sufficient conditions on the sequence r¦t¦ $̈= R(t) for the variance Var LN to be of exact order N, if the family of tries (resp. Patricia tries, resp. digital search trees) is equipped with the Bernoulli model. For the symmetric Bernoulli model we prove the existence of a continuous periodic function δ with period 1, such that Var LN ∼ δ(log2 N) .̄ N holds

    Distribution des symboles finaux dans un arbre de recherche avec des sources de Markov

    Get PDF
    Lempel-Ziv'78 is one of the most popular data compression algorithm on words. Over the last decades we uncover its fascinating behavior and understand better many of its beautiful properties. Among others, in 1995 by settling the Ziv conjecture we proved that for memoryless source (i.e., when a sequence is generated by a source without memory) the number of LZ'78 phrases satisfies the Central Limit Theorem (CLT). Since then the quest commenced to extend it to Markov sources, however, despite several attempts this problem is still open. In this conference paper, we revisit the issue and focus on a much simpler, but not trivial problem that may lead to the resolution of the LZ'78 dilemma. We consider the associated Digital Search Tree (DST) version of the problem in which the DST is built over a fixed number of Markov generated sequences. In such a model we shall count the number of of the so called "tail symbol", that is, the symbol that follows the last inserted symbol. Our goal here is to analyze this new quantity under Markovian assumption since it plays crucial role in the analysis of the original LZ'78 problem. We establish the mean, the variance, and the central limit theorem for the number of tail symbols. We accomplish it by applying techniques of analytic combinatorics on words also known as analytic pattern matching

    The expected profile of digital search trees

    Get PDF
    AbstractA digital search tree (DST) is a fundamental data structure on words that finds various applications from the popular Lempel–Zivʼ78 data compression scheme to distributed hash tables. The profile of a DST measures the number of nodes at the same distance from the root; it depends on the number of stored strings and the distance from the root. Most parameters of DST (e.g., depth, height, fillup) can be expressed in terms of the profile. We study here asymptotics of the average profile in a DST built from sequences generated independently by a memoryless source. After representing the average profile by a recurrence, we solve it using a wide range of analytic tools. This analysis is surprisingly demanding but once it is carried out it reveals an unusually intriguing and interesting behavior. The average profile undergoes phase transitions when moving from the root to the longest path: at first it resembles a full tree until it abruptly starts growing polynomially and oscillating in this range. These results are derived by methods of analytic combinatorics such as generating functions, Mellin transform, poissonization and depoissonization, the saddle point method, singularity analysis and uniform asymptotic analysis

    String Sanitization Under Edit Distance: Improved and Generalized

    Get PDF
    Let W be a string of length n over an alphabet Σ, k be a positive integer, and S be a set of length-k substrings of W. The ETFS problem asks us to construct a string XED such that: (i) no string of S occurs in XED; (ii) the order of all other length-k substrings over Σ is the same in W and in XED; and (iii) XED has minimal edit distance to W. When W represents an individual's data and S represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in O(n2k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in O(n2−δ) time, for any δ>0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an O(n2log2k)-time algorithm to solve ETFS; and (ii) an O(n2log2n)-time algorithm to solve AETFS, a generalization of ETFS in which the elements of S can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars

    Asymptotic Behavior of the Lempel-Ziv Parsing Scheme and Digital Search Trees

    Get PDF

    Asymptotic Behavior of the Lempel-Ziv Parsing Scheme and Digital Search Trees

    Get PDF
    The Lempel-Ziv parsing scheme finds a wide range of applications, most notably in data compression and algorithms on words. It partitions a sequence of length n into variable phrases such that a new phrase is the shortest substring not seen in the past as a phrase. The parameter of interest is the number M n of phrases that one can construct from a sequence of length n. In this paper, for the memoryless source with unequal probabilities of symbols generation we derive the limiting distribution of M n which turns out to be normal. This proves a long standing open problem. In fact, to obtain this result we solved another open problem, namely, that of establishing the limiting distribution of the internal path length in a digital search tree. The latter is a consequence of an asymptotic solution of a multiplicative differential-functional equation often arising in the analysis of algorithms on words. Interestingly enough, our findings are proved by a combination of probabilistic techniques such as renewal equation and uniform integrability, and analytical techniques such as Mellin transform, differential-functional equations, de-Poissonization, and so forth. In concluding remarks we indicate a possibility of extending our results to Markovian models
    corecore