3,756 research outputs found
ISSDC: Digram Coding Based Lossless Data Compression Algorithm
In this paper, a new lossless data compression method that is based on digram coding is introduced. This data compression method uses semi-static dictionaries: All of the used characters and most frequently used two character blocks (digrams) in the source are found and inserted into a dictionary in the first pass, compression is performed in the second pass. This two-pass structure is repeated several times and in every iteration particular number of elements is inserted in the dictionary until the dictionary is filled. This algorithm (ISSDC: Iterative Semi-Static Digram Coding) also includes some mechanisms that can decide about total number of iterations and dictionary size whenever these values are not given by the user. Our experiments show that ISSDC is better than LZW/GIF and BPE in compression ratio. It is worse than DEFLATE in compression of text and binary data, but better than PNG (which uses DEFLATE compression) in lossless compression of simple images
A Universal Parallel Two-Pass MDL Context Tree Compression Algorithm
Computing problems that handle large amounts of data necessitate the use of
lossless data compression for efficient storage and transmission. We present a
novel lossless universal data compression algorithm that uses parallel
computational units to increase the throughput. The length- input sequence
is partitioned into blocks. Processing each block independently of the
other blocks can accelerate the computation by a factor of , but degrades
the compression quality. Instead, our approach is to first estimate the minimum
description length (MDL) context tree source underlying the entire input, and
then encode each of the blocks in parallel based on the MDL source. With
this two-pass approach, the compression loss incurred by using more parallel
units is insignificant. Our algorithm is work-efficient, i.e., its
computational complexity is . Its redundancy is approximately
bits above Rissanen's lower bound on universal compression
performance, with respect to any context tree source whose maximal depth is at
most . We improve the compression by using different quantizers for
states of the context tree based on the number of symbols corresponding to
those states. Numerical results from a prototype implementation suggest that
our algorithm offers a better trade-off between compression and throughput than
competing universal data compression algorithms.Comment: Accepted to Journal of Selected Topics in Signal Processing special
issue on Signal Processing for Big Data (expected publication date June
2015). 10 pages double column, 6 figures, and 2 tables. arXiv admin note:
substantial text overlap with arXiv:1405.6322. Version: Mar 2015: Corrected a
typ
A Computable Measure of Algorithmic Probability by Finite Approximations with an Application to Integer Sequences
Given the widespread use of lossless compression algorithms to approximate
algorithmic (Kolmogorov-Chaitin) complexity, and that lossless compression
algorithms fall short at characterizing patterns other than statistical ones
not different to entropy estimations, here we explore an alternative and
complementary approach. We study formal properties of a Levin-inspired measure
calculated from the output distribution of small Turing machines. We
introduce and justify finite approximations that have been used in some
applications as an alternative to lossless compression algorithms for
approximating algorithmic (Kolmogorov-Chaitin) complexity. We provide proofs of
the relevant properties of both and and compare them to Levin's
Universal Distribution. We provide error estimations of with respect to
. Finally, we present an application to integer sequences from the Online
Encyclopedia of Integer Sequences which suggests that our AP-based measures may
characterize non-statistical patterns, and we report interesting correlations
with textual, function and program description lengths of the said sequences.Comment: As accepted by the journal Complexity (Wiley/Hindawi
Training-free Measures Based on Algorithmic Probability Identify High Nucleosome Occupancy in DNA Sequences
We introduce and study a set of training-free methods of
information-theoretic and algorithmic complexity nature applied to DNA
sequences to identify their potential capabilities to determine nucleosomal
binding sites. We test our measures on well-studied genomic sequences of
different sizes drawn from different sources. The measures reveal the known in
vivo versus in vitro predictive discrepancies and uncover their potential to
pinpoint (high) nucleosome occupancy. We explore different possible signals
within and beyond the nucleosome length and find that complexity indices are
informative of nucleosome occupancy. We compare against the gold standard
(Kaplan model) and find similar and complementary results with the main
difference that our sequence complexity approach. For example, for high
occupancy, complexity-based scores outperform the Kaplan model for predicting
binding representing a significant advancement in predicting the highest
nucleosome occupancy following a training-free approach.Comment: 8 pages main text (4 figures), 12 total with Supplementary (1 figure
Correlation of Automorphism Group Size and Topological Properties with Program-size Complexity Evaluations of Graphs and Complex Networks
We show that numerical approximations of Kolmogorov complexity (K) applied to
graph adjacency matrices capture some group-theoretic and topological
properties of graphs and empirical networks ranging from metabolic to social
networks. That K and the size of the group of automorphisms of a graph are
correlated opens up interesting connections to problems in computational
geometry, and thus connects several measures and concepts from complexity
science. We show that approximations of K characterise synthetic and natural
networks by their generating mechanisms, assigning lower algorithmic randomness
to complex network models (Watts-Strogatz and Barabasi-Albert networks) and
high Kolmogorov complexity to (random) Erdos-Renyi graphs. We derive these
results via two different Kolmogorov complexity approximation methods applied
to the adjacency matrices of the graphs and networks. The methods used are the
traditional lossless compression approach to Kolmogorov complexity, and a
normalised version of a Block Decomposition Method (BDM) measure, based on
algorithmic probability theory.Comment: 15 2-column pages, 20 figures. Forthcoming in Physica A: Statistical
Mechanics and its Application
- …