8,224 research outputs found
Compressing DNA sequence databases with coil
Background: Publicly available DNA sequence databases such as GenBank are large, and are
growing at an exponential rate. The sheer volume of data being dealt with presents serious storage
and data communications problems. Currently, sequence data is usually kept in large "flat files,"
which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which
rarely achieves good compression ratios. While much research has been done on compressing
individual DNA sequences, surprisingly little has focused on the compression of entire databases
of such sequences. In this study we introduce the sequence database compression software coil.
Results: We have designed and implemented a portable software package, coil, for compressing
and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared
towards achieving high compression ratios at the expense of execution time and memory usage
during compression – the compression time represents a "one-off investment" whose cost is
quickly amortised if the resulting compressed file is transmitted many times. Decompression
requires little memory and is extremely fast. We demonstrate a 5% improvement in compression
ratio over state-of-the-art general-purpose compression tools for a large GenBank database file
containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental
additions to a sequence database.
Conclusion: coil presents a compelling alternative to conventional compression of flat files for the
storage and distribution of DNA sequence databases having a narrow distribution of sequence
lengths, such as EST data. Increasing compression levels for databases having a wide distribution of
sequence lengths is a direction for future work
Data Discovery and Anomaly Detection Using Atypicality: Theory
A central question in the era of 'big data' is what to do with the enormous
amount of information. One possibility is to characterize it through
statistics, e.g., averages, or classify it using machine learning, in order to
understand the general structure of the overall data. The perspective in this
paper is the opposite, namely that most of the value in the information in some
applications is in the parts that deviate from the average, that are unusual,
atypical. We define what we mean by 'atypical' in an axiomatic way as data that
can be encoded with fewer bits in itself rather than using the code for the
typical data. We show that this definition has good theoretical properties. We
then develop an implementation based on universal source coding, and apply this
to a number of real world data sets.Comment: 40 page
- …