184 research outputs found

    Prospects and limitations of full-text index structures in genome analysis

    Get PDF
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

    Burrows–Wheeler compression: Principles and reflections

    Get PDF
    AbstractAfter a general description of the Burrows–Wheeler transform and a brief survey of recent work on processing its output, the paper examines the coding of the zero-runs from the MTF recoding stage, an aspect with little prior treatment. It is concluded that the original scheme proposed by Wheeler is extremely efficient and unlikely to be much improved.The paper then proposes some new interpretations and uses of the Burrows–Wheeler transform, with new insights and approaches to lossless compression, perhaps including techniques from error correction

    Empirical analysis of BWT-based lossless image compression

    Get PDF
    The Burrows-Wheeler Transformation (BWT) is a text transformation algorithm originally designed to improve the coherence in text data. This coherence can be exploited by compression algorithms such as run-length encoding or arithmetic coding. However, there is still a debate on its performance on images. Motivated by a theoretical analysis of the performance of BWT and MTF, we perform a detailed empirical study on the role of MTF in compressing images with the BWT. This research studies the compression performance of BWT on digital images using different predictors and context partitions. The major interest of the research is in finding efficient ways to make BWT suitable for lossless image compression.;This research studied three different approaches to improve the compression of image data by BWT. First, the idea of preprocessing the image data before sending it to the BWT compression scheme is studied by using different mapping and prediction schemes. Second, different variations of MTF were investigated to see which one works best for Image compression with BWT. Third, the concept of context partitioning for BWT output before it is forwarded to the next stage in the compression scheme.;For lossless image compression, this thesis proposes the removal of the MTF stage from the BWT compression pipeline and the usage of context partitioning method. The compression performance is further improved by using MED predictor on the image data along with the 8-bit mapping of the prediction residuals before it is processed by BWT.;This thesis proposes two schemes for BWT-based image coding, namely BLIC and BLICx, the later being based on the context-ordering property of the BWT. Our methods outperformed other text compression algorithms such as PPM, GZIP, direct BWT, and WinZip in compressing images. Final results showed that our methods performed better than the state of the art lossless image compression algorithms, such as JPEG-LS, JPEG2000, CALIC, EDP and PPAM on the natural images

    On Undetected Redundancy in the Burrows-Wheeler Transform

    Get PDF
    The Burrows-Wheeler-Transform (BWT) is an invertible permutation of a text known to be highly compressible but also useful for sequence analysis, what makes the BWT highly attractive for lossless data compression. In this paper, we present a new technique to reduce the size of a BWT using its combinatorial properties, while keeping it invertible. The technique can be applied to any BWT-based compressor, and, as experiments show, is able to reduce the encoding size by 8-16 % on average and up to 33-57 % in the best cases (depending on the BWT-compressor used), making BWT-based compressors competitive or even superior to today\u27s best lossless compressors

    Burrows‐Wheeler post‐transformation with effective clustering and interpolative coding

    Get PDF
    Lossless compression methods based on the Burrows‐Wheeler transform (BWT) are regarded as an excellent compromise between speed and compression efficiency: they provide compression rates close to the PPM algorithms, with the speed of dictionary‐based methods. Instead of the laborious statistics‐gathering process used in PPM, the BWT reversibly sorts the input symbols, using as the sort key as many following characters as necessary to make the sort unique. Characters occurring in similar contexts are sorted close together, resulting in a clustered symbol sequence. Run‐length encoding and Move‐to‐Front (MTF) recoding, combined with a statistical Huffman or arithmetic coder, is then typically used to exploit the clustering. A drawback of the MTF recoding is that knowledge of the character that produced the MTF number is lost. In this paper, we present a new, competitive Burrows‐Wheeler posttransform stage that takes advantage of interpolative coding—a fast binary encoding method for integer sequences, being able to exploit clusters without requiring explicit statistics. We introduce a fast and simple way to retain knowledge of the run characters during the MTF recoding and use this to improve the clustering of MTF numbers and run‐lengths by applying reversible, stable sorting, with the run characters as sort keys, achieving significant improvement in the compression rate, as shown here by experiments on common text corpora.</p

    Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

    Get PDF
    Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n over an alphabet of size {\sigma} on a RAM machine with words of w = {\Omega}(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma)

    ON THE COMPRESSION OF DIGITAL HOLOGRAMS

    Get PDF
    This thesis investigates the compression of computer-generated transmission holograms through lossless schemes such as the Burrows-Wheeler compression scheme (BWCS). Ever since Gabor’s discovery of holography, much research have been done to improve the record­ ing and viewing of holograms into more convenient uses such as video viewing. However, the compression of holograms where recording is performed from virtual scenes has not received much attention. Phase-shift digital holograms, on the other hand, have received more attention due to their practical application in object recognition, imaging, and video sequencing of phys­ ical objects. This study is performed for virtually recorded computer-generated holograms in order to understand compression factors in virtually recorded holograms. We also investigate application of lossless compression schemes to holograms with reduced precision for the in­ tensity and phase values. The overall objective is to explore the factors that affect effective compression of virtual holograms. As a result, this work can be used to assist in the design­ ing of better compression algorithms for applications such as virtual object simulations, video gaming application, and holographic video viewing

    Hadooping the genome: The impact of big data tools on biology

    Get PDF
    This essay examines the consequences of the so-called ‘big data’ technologies in biomedicine. Analyzing algorithms and data structures used by biologists can provide insight into how biologists perceive and understand their objects of study. As such, I examine some of the most widely used algorithms in genomics: those used for sequence comparison or sequence mapping. These algorithms are derived from the powerful tools for text searching and indexing that have been developed since the 1950s and now play an important role in online search. In biology, sequence comparison algorithms have been used to assemble genomes, process next-generation sequence data, and, most recently, for ‘precision medicine.’ I argue that the predominance of a specific set of text-matching and pattern-finding tools has influenced problem choice in genomics. It allowed genomics to continue to think of genomes as textual objects and to increasingly lock genomics into ‘big data’-driven text-searching methods. Many ‘big data’ methods are designed for finding patterns in human-written texts. However, genomes and other’ omic data are not human-written and are unlikely to be meaningful in the same way

    ALFALFA : fast and accurate mapping of long next generation sequencing reads

    Get PDF
    corecore