Search CORE

226 research outputs found

Optimal Prefix Codes with Fewer Distinct Codeword Lengths are Faster to Construct

Author: Belal Ahmed
Elmasry Amr
Publication venue
Publication date: 16/08/2016
Field of study

A new method for constructing minimum-redundancy binary prefix codes is described. Our method does not explicitly build a Huffman tree; instead it uses a property of optimal prefix codes to compute the codeword lengths corresponding to the input weights. Let

n

be the number of weights and

k

be the number of distinct codeword lengths as produced by the algorithm for the optimum codes. The running time of our algorithm is

O(k \cdot n)

. Following our previous work in \cite{be}, no algorithm can possibly construct optimal prefix codes in

o(k \cdot n)

time. When the given weights are presorted our algorithm performs

O(9^k \cdot \log^{2k}{n})

comparisons.Comment: 23 pages, a preliminary version appeared in STACS 200

arXiv.org e-Print Archive

CiteSeerX

Prefix Codes: Equiprobable Words, Unequal Letter Costs

Author: Knuth Donald E.
Mordecai J. Golin
Neal Young
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/1996
Field of study

Describes a near-linear-time algorithm for a variant of Huffman coding, in which the letters may have non-uniform lengths (as in Morse code), but with the restriction that each word to be encoded has equal probability. [See also ``Huffman Coding with Unequal Letter Costs'' (2002).]Comment: proceedings version in ICALP (1994

arXiv.org e-Print Archive

CiteSeerX

Crossref

eScholarship - University of California

Hong Kong University of Science and Technology Institutional Repository

Recommended from our members

Data compressions on machines with limited memory

Author: Lelewer Debra Ann
Publication venue: eScholarship, University of California
Publication date: 01/01/1991
Field of study

We consider two problems in which machines with limited internal memory are used to compress and decompress data. In the first application, a powerful encoder transmits a coded file to a decoder that has severely constrained memory. A data structure that achieves minimum storage is presented, and alternative methods that sacrifice a small amount of storage to attain faster decoding are described. The second problem we address is that of encoding and decoding in limited memory. Methods for representing context models succinctly are described. These methods provide compression performance that is superior to state-of-the-art techniques, and competitive with newer approaches that use five times as much internal memory

eScholarship - University of California

Fast dictionary-based compression for inverted indexes

Author: Moffat A.
Petri M.
Pibiri G. E.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches. In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved. Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Efficient homology search for genomic sequence databases

Author: Cameron M
Publication venue: RMIT University
Publication date: 01/01/2006
Field of study

Genomic search tools can provide valuable insights into the chemical structure, evolutionary origin and biochemical function of genetic material. A homology search algorithm compares a protein or nucleotide query sequence to each entry in a large sequence database and reports alignments with highly similar sequences. The exponential growth of public data banks such as GenBank has necessitated the development of fast, heuristic approaches to homology search. The versatile and popular blast algorithm, developed by researchers at the US National Center for Biotechnology Information (NCBI), uses a four-stage heuristic approach to efficiently search large collections for analogous sequences while retaining a high degree of accuracy. Despite an abundance of alternative approaches to homology search, blast remains the only method to offer fast, sensitive search of large genomic collections on modern desktop hardware. As a result, the tool has found widespread use with millions of queries posed each day. A significant investment of computing resources is required to process this large volume of genomic searches and a cluster of over 200 workstations is employed by the NCBI to handle queries posed through the organisation's website. As the growth of sequence databases continues to outpace improvements in modern hardware, blast searches are becoming slower each year and novel, faster methods for sequence comparison are required. In this thesis we propose new techniques for fast yet accurate homology search that result in significantly faster blast searches. First, we describe improvements to the final, gapped alignment stages where the query and sequences from the collection are aligned to provide a fine-grain measure of similarity. We describe three new methods for aligning sequences that roughly halve the time required to perform this computationally expensive stage. Next, we investigate improvements to the first stage of search, where short regions of similarity between a pair of sequences are identified. We propose a novel deterministic finite automaton data structure that is significantly smaller than the codeword lookup table employed by ncbi-blast, resulting in improved cache performance and faster search times. We also discuss fast methods for nucleotide sequence comparison. We describe novel approaches for processing sequences that are compressed using the byte packed format already utilised by blast, where four nucleotide bases from a strand of DNA are stored in a single byte. Rather than decompress sequences to perform pairwise comparisons, our innovations permit sequences to be processed in their compressed form, four bases at a time. Our techniques roughly halve average query evaluation times for nucleotide searches with no effect on the sensitivity of blast. Finally, we present a new scheme for managing the high degree of redundancy that is prevalent in genomic collections. Near-duplicate entries in sequence data banks are highly detrimental to retrieval performance, however existing methods for managing redundancy are both slow, requiring almost ten hours to process the GenBank database, and crude, because they simply purge highly-similar sequences to reduce the level of internal redundancy. We describe a new approach for identifying near-duplicate entries that is roughly six times faster than the most successful existing approaches, and a novel approach to managing redundancy that reduces collection size and search times but still provides accurate and comprehensive search results. Our improvements to blast have been integrated into our own version of the tool. We find that our innovations more than halve average search times for nucleotide and protein searches, and have no signifcant effect on search accuracy. Given the enormous popularity of blast, this represents a very significant advance in computational methods to aid life science research

RMIT Research Repository

Recommended from our members

Codes for Synchronization in Channels and Sources with Edits

Author: Abroshan Mahed
Publication venue: University of Cambridge
Publication date: 22/07/2019
Field of study

Edit channels are a class of communication channels where the output of the channel is an edited version of the input. The edits are considered to be deletions and insertions. DNA-based data storage system is one of the motivations for this model. This thesis studies various problems related to edit channel and also edit synchronization problem. Varshamov-Tenengolts (VT) codes are first introduced. These codes can correct a single deletion or insertion and have a linear-time decoder. The problem of efficient encoding of the non-binary version of VT codes is addressed, where a simple linear-time encoding method to systematically map binary message sequences onto VT codewords is proposed. Another model that is studied is segmented edit channels, where we have the additional assumption that the channel input sequence is implicitly divided into segments such that at most one edit can occur within a segment. A code construction is proposed for this model based on subsets of VT codes chosen with pre-determined prefxes and/or sufxes. Also an upper bound is derived on the rate of any zero-error code for the segmented edit channel in terms of the segment length. This upper bound shows that the rate scaling of the proposed codes as the segment length increases is the same as that of the maximal code. Edit synchronization is another problem studied in this thesis. In this model, there are two remote nodes (encoder and decoder), each having a binary sequence. The sequence X, available at the encoder, is the updated sequence and diﬀers from Y (available at the decoder) by a small number of edits. The goal is to construct a message M, to be sent via a one-way error-free link, such that the decoder can reconstruct X using M and Y. A coding scheme is devised for this one-way synchronization model. The scheme is based on multiple layers of VT codes combined with oﬀ-the-shelf linear error-correcting codes and uses a list decoder. Motivated by the sequence reconstruction problem from traces in DNA-based storage, the problem of designing codes for the deletion channel when multiple observations (or traces) are available to the decoder is considered. A simple binary and non-binary code is proposed that splits the codeword into blocks and employs a VT code in each block. The availability of multiple traces helps the decoder to identify deletion-free copies of a block, and to avoid mis-synchronization while decoding. The encoding complexity of the proposed scheme is linear in the codeword length; the decoding complexity is linear in the codeword length and quadratic in the number of deletions and the number of traces. The list decoding technique for the proposed code is also considered

Apollo (Cambridge)

Gbit/second lossless data compression hardware

Author: Jose L. Nunez-Yanez (7202684)
Publication venue
Publication date: 01/01/2001
Field of study

This thesis investigates how to improve the performance of lossless data compression hardware as a tool to reduce the cost per bit stored in a computer system or transmitted over a communication network. Lossless data compression allows the exact reconstruction of the original data after decompression. Its deployment in some high-bandwidth applications has been hampered due to performance limitations in the compressing hardware that needs to match the performance of the original system to avoid becoming a bottleneck. Advancing the area of lossless data compression hardware, hence, offers a valid motivation with the potential of doubling the performance of the system that incorporates it with minimum investment. This work starts by presenting an analysis of current compression methods with the objective of identifying the factors that limit performance and also the factors that increase it. [Continues.

Loughborough University Institutional Repository