226 research outputs found
Optimal Prefix Codes with Fewer Distinct Codeword Lengths are Faster to Construct
A new method for constructing minimum-redundancy binary prefix codes is
described. Our method does not explicitly build a Huffman tree; instead it uses
a property of optimal prefix codes to compute the codeword lengths
corresponding to the input weights. Let be the number of weights and be
the number of distinct codeword lengths as produced by the algorithm for the
optimum codes. The running time of our algorithm is . Following
our previous work in \cite{be}, no algorithm can possibly construct optimal
prefix codes in time. When the given weights are presorted our
algorithm performs comparisons.Comment: 23 pages, a preliminary version appeared in STACS 200
Prefix Codes: Equiprobable Words, Unequal Letter Costs
Describes a near-linear-time algorithm for a variant of Huffman coding, in
which the letters may have non-uniform lengths (as in Morse code), but with the
restriction that each word to be encoded has equal probability. [See also
``Huffman Coding with Unequal Letter Costs'' (2002).]Comment: proceedings version in ICALP (1994
Recommended from our members
Data compressions on machines with limited memory
We consider two problems in which machines with limited internal memory are used to compress and decompress data. In the first application, a powerful encoder transmits a coded file to a decoder that has severely constrained memory. A data structure that achieves minimum storage is presented, and alternative methods that sacrifice a small amount of storage to attain faster decoding are described. The second problem we address is that of encoding and decoding in limited memory. Methods for representing context models succinctly are described. These methods provide compression performance that is superior to state-of-the-art techniques, and competitive with newer approaches that use five times as much internal memory
Fast dictionary-based compression for inverted indexes
Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed
Efficient homology search for genomic sequence databases
Genomic search tools can provide valuable insights into the chemical structure, evolutionary origin and biochemical function of genetic material. A homology search algorithm compares a protein or nucleotide query sequence to each entry in a large sequence database and reports alignments with highly similar sequences. The exponential growth of public data banks such as GenBank has necessitated the development of fast, heuristic approaches to homology search. The versatile and popular blast algorithm, developed by researchers at the US National Center for Biotechnology Information (NCBI), uses a four-stage heuristic approach to efficiently search large collections for analogous sequences while retaining a high degree of accuracy. Despite an abundance of alternative approaches to homology search, blast remains the only method to offer fast, sensitive search of large genomic collections on modern desktop hardware. As a result, the tool has found widespread use with millions of queries posed each day. A significant investment of computing resources is required to process this large volume of genomic searches and a cluster of over 200 workstations is employed by the NCBI to handle queries posed through the organisation's website. As the growth of sequence databases continues to outpace improvements in modern hardware, blast searches are becoming slower each year and novel, faster methods for sequence comparison are required. In this thesis we propose new techniques for fast yet accurate homology search that result in significantly faster blast searches. First, we describe improvements to the final, gapped alignment stages where the query and sequences from the collection are aligned to provide a fine-grain measure of similarity. We describe three new methods for aligning sequences that roughly halve the time required to perform this computationally expensive stage. Next, we investigate improvements to the first stage of search, where short regions of similarity between a pair of sequences are identified. We propose a novel deterministic finite automaton data structure that is significantly smaller than the codeword lookup table employed by ncbi-blast, resulting in improved cache performance and faster search times. We also discuss fast methods for nucleotide sequence comparison. We describe novel approaches for processing sequences that are compressed using the byte packed format already utilised by blast, where four nucleotide bases from a strand of DNA are stored in a single byte. Rather than decompress sequences to perform pairwise comparisons, our innovations permit sequences to be processed in their compressed form, four bases at a time. Our techniques roughly halve average query evaluation times for nucleotide searches with no effect on the sensitivity of blast. Finally, we present a new scheme for managing the high degree of redundancy that is prevalent in genomic collections. Near-duplicate entries in sequence data banks are highly detrimental to retrieval performance, however existing methods for managing redundancy are both slow, requiring almost ten hours to process the GenBank database, and crude, because they simply purge highly-similar sequences to reduce the level of internal redundancy. We describe a new approach for identifying near-duplicate entries that is roughly six times faster than the most successful existing approaches, and a novel approach to managing redundancy that reduces collection size and search times but still provides accurate and comprehensive search results. Our improvements to blast have been integrated into our own version of the tool. We find that our innovations more than halve average search times for nucleotide and protein searches, and have no signifcant effect on search accuracy. Given the enormous popularity of blast, this represents a very significant advance in computational methods to aid life science research
Recommended from our members
Codes for Synchronization in Channels and Sources with Edits
Edit channels are a class of communication channels where the output of the channel is
an edited version of the input. The edits are considered to be deletions and insertions.
DNA-based data storage system is one of the motivations for this model. This thesis
studies various problems related to edit channel and also edit synchronization problem.
Varshamov-Tenengolts (VT) codes are first introduced. These codes can correct a
single deletion or insertion and have a linear-time decoder. The problem of efficient
encoding of the non-binary version of VT codes is addressed, where a simple linear-time
encoding method to systematically map binary message sequences onto VT codewords
is proposed.
Another model that is studied is segmented edit channels, where we have the
additional assumption that the channel input sequence is implicitly divided into
segments such that at most one edit can occur within a segment. A code construction
is proposed for this model based on subsets of VT codes chosen with pre-determined
prefxes and/or sufxes. Also an upper bound is derived on the rate of any zero-error
code for the segmented edit channel in terms of the segment length. This upper bound
shows that the rate scaling of the proposed codes as the segment length increases is
the same as that of the maximal code.
Edit synchronization is another problem studied in this thesis. In this model, there
are two remote nodes (encoder and decoder), each having a binary sequence. The
sequence X, available at the encoder, is the updated sequence and differs from Y
(available at the decoder) by a small number of edits. The goal is to construct a message
M, to be sent via a one-way error-free link, such that the decoder can reconstruct X
using M and Y. A coding scheme is devised for this one-way synchronization model.
The scheme is based on multiple layers of VT codes combined with off-the-shelf linear
error-correcting codes and uses a list decoder.
Motivated by the sequence reconstruction problem from traces in DNA-based storage, the problem of designing codes for the deletion channel when multiple observations
(or traces) are available to the decoder is considered. A simple binary and non-binary
code is proposed that splits the codeword into blocks and employs a VT code in each
block. The availability of multiple traces helps the decoder to identify deletion-free
copies of a block, and to avoid mis-synchronization while decoding. The encoding
complexity of the proposed scheme is linear in the codeword length; the decoding
complexity is linear in the codeword length and quadratic in the number of deletions
and the number of traces. The list decoding technique for the proposed code is also
considered
Gbit/second lossless data compression hardware
This thesis investigates how to improve the performance of lossless data compression hardware
as a tool to reduce the cost per bit stored in a computer system or transmitted over a
communication network.
Lossless data compression allows the exact reconstruction of the original data after
decompression. Its deployment in some high-bandwidth applications has been hampered due to
performance limitations in the compressing hardware that needs to match the performance of the
original system to avoid becoming a bottleneck. Advancing the area of lossless data compression
hardware, hence, offers a valid motivation with the potential of doubling the performance of the
system that incorporates it with minimum investment.
This work starts by presenting an analysis of current compression methods with the objective of
identifying the factors that limit performance and also the factors that increase it. [Continues.
- …