Search CORE

3,756 research outputs found

ISSDC: Digram Coding Based Lossless Data Compression Algorithm

Author: Carus Aydin
Mesut Altan
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 26/01/2012
Field of study

In this paper, a new lossless data compression method that is based on digram coding is introduced. This data compression method uses semi-static dictionaries: All of the used characters and most frequently used two character blocks (digrams) in the source are found and inserted into a dictionary in the first pass, compression is performed in the second pass. This two-pass structure is repeated several times and in every iteration particular number of elements is inserted in the dictionary until the dictionary is filled. This algorithm (ISSDC: Iterative Semi-Static Digram Coding) also includes some mechanisms that can decide about total number of iterations and dictionary size whenever these values are not given by the user. Our experiments show that ISSDC is better than LZW/GIF and BPE in compression ratio. It is worse than DEFLATE in compression of text and binary data, but better than PNG (which uses DEFLATE compression) in lossless compression of simple images

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

A Universal Parallel Two-Pass MDL Context Tree Compression Algorithm

Author: Baron Dror
Krishnan Nikhil
Publication venue
Publication date: 21/03/2015
Field of study

Computing problems that handle large amounts of data necessitate the use of lossless data compression for efficient storage and transmission. We present a novel lossless universal data compression algorithm that uses parallel computational units to increase the throughput. The length-

N

input sequence is partitioned into

B

blocks. Processing each block independently of the other blocks can accelerate the computation by a factor of

B

, but degrades the compression quality. Instead, our approach is to first estimate the minimum description length (MDL) context tree source underlying the entire input, and then encode each of the

B

blocks in parallel based on the MDL source. With this two-pass approach, the compression loss incurred by using more parallel units is insignificant. Our algorithm is work-efficient, i.e., its computational complexity is

O(N/B)

. Its redundancy is approximately

B\log(N/B)

bits above Rissanen's lower bound on universal compression performance, with respect to any context tree source whose maximal depth is at most

\log(N/B)

. We improve the compression by using different quantizers for states of the context tree based on the number of symbols corresponding to those states. Numerical results from a prototype implementation suggest that our algorithm offers a better trade-off between compression and throughput than competing universal data compression algorithms.Comment: Accepted to Journal of Selected Topics in Signal Processing special issue on Signal Processing for Big Data (expected publication date June 2015). 10 pages double column, 6 figures, and 2 tables. arXiv admin note: substantial text overlap with arXiv:1405.6322. Version: Mar 2015: Corrected a typ

arXiv.org e-Print Archive

A Computable Measure of Algorithmic Probability by Finite Approximations with an Application to Integer Sequences

Author: Soler-Toscano Fernando
Zenil Hector
Publication venue
Publication date: 01/01/2017
Field of study

Given the widespread use of lossless compression algorithms to approximate algorithmic (Kolmogorov-Chaitin) complexity, and that lossless compression algorithms fall short at characterizing patterns other than statistical ones not different to entropy estimations, here we explore an alternative and complementary approach. We study formal properties of a Levin-inspired measure

m

calculated from the output distribution of small Turing machines. We introduce and justify finite approximations

m_k

that have been used in some applications as an alternative to lossless compression algorithms for approximating algorithmic (Kolmogorov-Chaitin) complexity. We provide proofs of the relevant properties of both

m

and

m_k

and compare them to Levin's Universal Distribution. We provide error estimations of

m_k

with respect to

m

. Finally, we present an application to integer sequences from the Online Encyclopedia of Integer Sequences which suggests that our AP-based measures may characterize non-statistical patterns, and we report interesting correlations with textual, function and program description lengths of the said sequences.Comment: As accepted by the journal Complexity (Wiley/Hindawi

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

idUS. Depósito de Investigación Universidad de Sevilla

Training-free Measures Based on Algorithmic Probability Identify High Nucleosome Occupancy in DNA Sequences

Author: Minary Peter
Zenil Hector
Publication venue
Publication date: 16/10/2018
Field of study

We introduce and study a set of training-free methods of information-theoretic and algorithmic complexity nature applied to DNA sequences to identify their potential capabilities to determine nucleosomal binding sites. We test our measures on well-studied genomic sequences of different sizes drawn from different sources. The measures reveal the known in vivo versus in vitro predictive discrepancies and uncover their potential to pinpoint (high) nucleosome occupancy. We explore different possible signals within and beyond the nucleosome length and find that complexity indices are informative of nucleosome occupancy. We compare against the gold standard (Kaplan model) and find similar and complementary results with the main difference that our sequence complexity approach. For example, for high occupancy, complexity-based scores outperform the Kaplan model for predicting binding representing a significant advancement in predicting the highest nucleosome occupancy following a training-free approach.Comment: 8 pages main text (4 figures), 12 total with Supplementary (1 figure

arXiv.org e-Print Archive

Oxford University Research Archive

Correlation of Automorphism Group Size and Topological Properties with Program-size Complexity Evaluations of Graphs and Complex Networks

Author: Dingle Kamaludin
Louis Ard A.
Soler-Toscano Fernando
Zenil Hector
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

We show that numerical approximations of Kolmogorov complexity (K) applied to graph adjacency matrices capture some group-theoretic and topological properties of graphs and empirical networks ranging from metabolic to social networks. That K and the size of the group of automorphisms of a graph are correlated opens up interesting connections to problems in computational geometry, and thus connects several measures and concepts from complexity science. We show that approximations of K characterise synthetic and natural networks by their generating mechanisms, assigning lower algorithmic randomness to complex network models (Watts-Strogatz and Barabasi-Albert networks) and high Kolmogorov complexity to (random) Erdos-Renyi graphs. We derive these results via two different Kolmogorov complexity approximation methods applied to the adjacency matrices of the graphs and networks. The methods used are the traditional lossless compression approach to Kolmogorov complexity, and a normalised version of a Block Decomposition Method (BDM) measure, based on algorithmic probability theory.Comment: 15 2-column pages, 20 figures. Forthcoming in Physica A: Statistical Mechanics and its Application

arXiv.org e-Print Archive

Oxford University Research Archive

idUS. Depósito de Investigación Universidad de Sevilla