461 research outputs found
Image Characterization and Classification by Physical Complexity
We present a method for estimating the complexity of an image based on
Bennett's concept of logical depth. Bennett identified logical depth as the
appropriate measure of organized complexity, and hence as being better suited
to the evaluation of the complexity of objects in the physical world. Its use
results in a different, and in some sense a finer characterization than is
obtained through the application of the concept of Kolmogorov complexity alone.
We use this measure to classify images by their information content. The method
provides a means for classifying and evaluating the complexity of objects by
way of their visual representations. To the authors' knowledge, the method and
application inspired by the concept of logical depth presented herein are being
proposed and implemented for the first time.Comment: 30 pages, 21 figure
Reordering Rows for Better Compression: Beyond the Lexicographic Order
Sorting database tables before compressing them improves the compression
rate. Can we do better than the lexicographical order? For minimizing the
number of runs in a run-length encoding compression scheme, the best approaches
to row-ordering are derived from traveling salesman heuristics, although there
is a significant trade-off between running time and compression. A new
heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades
off compression for a major running-time speedup, is a good option for very
large tables. However, for some compression schemes, it is more important to
generate long runs rather than few runs. For this case, another novel
heuristic, Vortex, is promising. We find that we can improve run-length
encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%:
these gains are on top of the gains due to lexicographically sorting the table.
We prove that the new row reordering is optimal (within 10%) at minimizing the
runs of identical values within columns, in a few cases.Comment: to appear in ACM TOD
Efficiently Extracting Randomness from Imperfect Stochastic Processes
We study the problem of extracting a prescribed number of random bits by
reading the smallest possible number of symbols from non-ideal stochastic
processes. The related interval algorithm proposed by Han and Hoshi has
asymptotically optimal performance; however, it assumes that the distribution
of the input stochastic process is known. The motivation for our work is the
fact that, in practice, sources of randomness have inherent correlations and
are affected by measurement's noise. Namely, it is hard to obtain an accurate
estimation of the distribution. This challenge was addressed by the concepts of
seeded and seedless extractors that can handle general random sources with
unknown distributions. However, known seeded and seedless extractors provide
extraction efficiencies that are substantially smaller than Shannon's entropy
limit. Our main contribution is the design of extractors that have a variable
input-length and a fixed output length, are efficient in the consumption of
symbols from the source, are capable of generating random bits from general
stochastic processes and approach the information theoretic upper bound on
efficiency.Comment: 2 columns, 16 page
Computational Intelligence and Complexity Measures for Chaotic Information Processing
This dissertation investigates the application of computational intelligence methods in the analysis of nonlinear chaotic systems in the framework of many known and newly designed complex systems. Parallel comparisons are made between these methods. This provides insight into the difficult challenges facing nonlinear systems characterization and aids in developing a generalized algorithm in computing algorithmic complexity measures, Lyapunov exponents, information dimension and topological entropy. These metrics are implemented to characterize the dynamic patterns of discrete and continuous systems. These metrics make it possible to distinguish order from disorder in these systems. Steps required for computing Lyapunov exponents with a reorthonormalization method and a group theory approach are formalized. Procedures for implementing computational algorithms are designed and numerical results for each system are presented. The advance-time sampling technique is designed to overcome the scarcity of phase space samples and the buffer overflow problem in algorithmic complexity measure estimation in slow dynamics feedback-controlled systems. It is proved analytically and tested numerically that for a quasiperiodic system like a Fibonacci map, complexity grows logarithmically with the evolutionary length of the data block. It is concluded that a normalized algorithmic complexity measure can be used as a system classifier. This quantity turns out to be one for random sequences and a non-zero value less than one for chaotic sequences. For periodic and quasi-periodic responses, as data strings grow their normalized complexity approaches zero, while a faster deceasing rate is observed for periodic responses. Algorithmic complexity analysis is performed on a class of certain rate convolutional encoders. The degree of diffusion in random-like patterns is measured. Simulation evidence indicates that algorithmic complexity associated with a particular class of 1/n-rate code increases with the increase of the encoder constraint length. This occurs in parallel with the increase of error correcting capacity of the decoder. Comparing groups of rate-1/n convolutional encoders, it is observed that as the encoder rate decreases from 1/2 to 1/7, the encoded data sequence manifests smaller algorithmic complexity with a larger free distance value
Lempel Ziv Welch data compression using associative processing as an enabling technology for real time application
Data compression is a term that refers to the reduction of data representation requirements either in storage and/or in transmission. A commonly used algorithm for compression is the Lempel-Ziv-Welch (LZW) method proposed by Terry A. Welch[l]. LZW is an adaptive, dictionary based, lossless algorithm. This provides for a general compression mechanism that is applicable to a broad range of inputs. Furthermore, the lossless nature of LZW implies that it is a reversible process which results in the original file/message being fully recoverable from compression. A variant of this algorithm is currently the foundation of the UNIX compress program. Additionally, LZW is one of the compression schemes defined in the TIFF standard[12], as well as in the CCITT V.42bis standard. One of the challenges in designing an efficient compression mechanism, such as LZW, which can be used in real time applications, is the speed of the search into the data dictionary. In this paper an Associative Processing implementation of the LZW algorithm is presented. This approach provides an efficient solution to this requirement. Additionally, it is shown that Associative Processing (ASP) allows for rapid and elegant development of the LZW algorithm that will generally outperform standard approaches in complexity, readability, and performance
New Algorithms and Lower Bounds for Sequential-Access Data Compression
This thesis concerns sequential-access data compression, i.e., by algorithms
that read the input one or more times from beginning to end. In one chapter we
consider adaptive prefix coding, for which we must read the input character by
character, outputting each character's self-delimiting codeword before reading
the next one. We show how to encode and decode each character in constant
worst-case time while producing an encoding whose length is worst-case optimal.
In another chapter we consider one-pass compression with memory bounded in
terms of the alphabet size and context length, and prove a nearly tight
tradeoff between the amount of memory we can use and the quality of the
compression we can achieve. In a third chapter we consider compression in the
read/write streams model, which allows us passes and memory both
polylogarithmic in the size of the input. We first show how to achieve
universal compression using only one pass over one stream. We then show that
one stream is not sufficient for achieving good grammar-based compression.
Finally, we show that two streams are necessary and sufficient for achieving
entropy-only bounds.Comment: draft of PhD thesi
Compressing DNA sequence databases with coil
Background: Publicly available DNA sequence databases such as GenBank are large, and are
growing at an exponential rate. The sheer volume of data being dealt with presents serious storage
and data communications problems. Currently, sequence data is usually kept in large "flat files,"
which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which
rarely achieves good compression ratios. While much research has been done on compressing
individual DNA sequences, surprisingly little has focused on the compression of entire databases
of such sequences. In this study we introduce the sequence database compression software coil.
Results: We have designed and implemented a portable software package, coil, for compressing
and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared
towards achieving high compression ratios at the expense of execution time and memory usage
during compression – the compression time represents a "one-off investment" whose cost is
quickly amortised if the resulting compressed file is transmitted many times. Decompression
requires little memory and is extremely fast. We demonstrate a 5% improvement in compression
ratio over state-of-the-art general-purpose compression tools for a large GenBank database file
containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental
additions to a sequence database.
Conclusion: coil presents a compelling alternative to conventional compression of flat files for the
storage and distribution of DNA sequence databases having a narrow distribution of sequence
lengths, such as EST data. Increasing compression levels for databases having a wide distribution of
sequence lengths is a direction for future work
Hierarchical Relative Lempel-Ziv Compression
Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string S is compressed relative to a second string R (called the reference) by parsing S into a sequence of substrings that occur in R. RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, such as a set of genomes of individuals from the same species. With the now cheap cost of DNA sequencing, such datasets have become extremely abundant and are rapidly growing. In this paper, instead of using a single reference string for the entire collection, we investigate the use of different reference strings for subsets of the collection, with the aim of improving compression. In particular, we propose a new compression scheme hierarchical relative Lempel-Ziv (HRLZ) which form a rooted tree (or hierarchy) on the strings and then compress each string using RLZ with parent as reference, storing only the root of the tree in plain text. To decompress, we traverse the tree in BFS order starting at the root, decompressing children with respect to their parent. We show that this approach leads to a twofold improvement in compression on bacterial genome datasets, with negligible effect on decompression time compared to the standard single reference approach. We show that an effective hierarchy for a given set of strings can be constructed by computing the optimal arborescence of a completed weighted digraph of the strings, with weights as the number of phrases in the RLZ parsing of the source and destination vertices. We further show that instead of computing the complete graph, a sparse graph derived using locality-sensitive hashing can significantly reduce the cost of computing a good hierarchy, without adversely effecting compression performance
- …