596 research outputs found

    Data structures and compression algorithms for high-throughput sequencing technologies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High-throughput sequencing (HTS) technologies play important roles in the life sciences by allowing the rapid parallel sequencing of very large numbers of relatively short nucleotide sequences, in applications ranging from genome sequencing and resequencing to digital microarrays and ChIP-Seq experiments. As experiments scale up, HTS technologies create new bioinformatics challenges for the storage and sharing of HTS data.</p> <p>Results</p> <p>We develop data structures and compression algorithms for HTS data. A processing stage maps short sequences to a reference genome or a large table of sequences. Then the integers representing the short sequence absolute or relative addresses, their length, and the substitutions they may contain are compressed and stored using various entropy coding algorithms, including both old and new fixed codes (e.g Golomb, Elias Gamma, MOV) and variable codes (e.g. Huffman). The general methodology is illustrated and applied to several HTS data sets. Results show that the information contained in HTS files can be compressed by a factor of 10 or more, depending on the statistical properties of the data sets and various other choices and constraints. Our algorithms fair well against general purpose compression programs such as gzip, bzip2 and 7zip; timing results show that our algorithms are consistently faster than the best general purpose compression programs.</p> <p>Conclusions</p> <p>It is not likely that exactly one encoding strategy will be optimal for all types of HTS data. Different experimental conditions are going to generate various data distributions whereby one encoding strategy can be more effective than another. We have implemented some of our encoding algorithms into the software package GenCompress which is available upon request from the authors. With the advent of HTS technology and increasingly new experimental protocols for using the technology, sequence databases are expected to continue rising in size. The methodology we have proposed is general, and these advanced compression techniques should allow researchers to manage and share their HTS data in a more timely fashion.</p

    Hall Normalization Constants for the Bures Volumes of the n-State Quantum Systems

    Full text link
    We report the results of certain integrations of quantum-theoretic interest, relying, in this regard, upon recently developed parameterizations of Boya et al of the n x n density matrices, in terms of squared components of the unit (n-1)-sphere and the n x n unitary matrices. Firstly, we express the normalized volume elements of the Bures (minimal monotone) metric for n = 2 and 3, obtaining thereby "Bures prior probability distributions" over the two- and three-state systems. Then, as an essential first step in extending these results to n > 3, we determine that the "Hall normalization constant" (C_{n}) for the marginal Bures prior probability distribution over the (n-1)-dimensional simplex of the n eigenvalues of the n x n density matrices is, for n = 4, equal to 71680/pi^2. Since we also find that C_{3} = 35/pi, it follows that C_{4} is simply equal to 2^{11} C_{3}/pi. (C_{2} itself is known to equal 2/pi.) The constant C_{5} is also found. It too is associated with a remarkably simple decompositon, involving the product of the eight consecutive prime numbers from 2 to 23. We also preliminarily investigate several cases, n > 5, with the use of quasi-Monte Carlo integration. We hope that the various analyses reported will prove useful in deriving a general formula (which evidence suggests will involve the Bernoulli numbers) for the Hall normalization constant for arbitrary n. This would have diverse applications, including quantum inference and universal quantum coding.Comment: 14 pages, LaTeX, 6 postscript figures. Revised version to appear in J. Phys. A. We make a few slight changes from the previous version, but also add a subsection (III G) in which several variations of the basic problem are newly studied. Rather strong evidence is adduced that the Hall constants are related to partial sums of denominators of the even-indexed Bernoulli numbers, although a general formula is still lackin

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

    Attribute Value Reordering For Efficient Hybrid OLAP

    Get PDF
    The normalization of a data cube is the ordering of the attribute values. For large multidimensional arrays where dense and sparse chunks are stored differently, proper normalization can lead to improved storage efficiency. We show that it is NP-hard to compute an optimal normalization even for 1x3 chunks, although we find an exact algorithm for 1x2 chunks. When dimensions are nearly statistically independent, we show that dimension-wise attribute frequency sorting is an optimal normalization and takes time O(d n log(n)) for data cubes of size n^d. When dimensions are not independent, we propose and evaluate several heuristics. The hybrid OLAP (HOLAP) storage mechanism is already 19%-30% more efficient than ROLAP, but normalization can improve it further by 9%-13% for a total gain of 29%-44% over ROLAP

    A "Learned" Approach to Quicken and Compress Rank/Select Dictionaries

    Get PDF
    We address the well-known problem of designing, implementing and experimenting compressed data structures for supporting rank and select queries over a dictionary of integers. This problem has been studied far and wide since the end of the ‘80s with tons of important theoretical and practical results. Following a recent line of research on the so-called learned data structures, we first show that this problem has a surprising connection with the geometry of a set of points in the Cartesian plane suitably derived from the input integers. We then build upon some classical results in computational geometry to introduce the first “learned” scheme for implementing a compressed rank/select dictionary. We prove theoretical bounds on its time and space performance both in the worst case and in the case of input distributions with finite mean and variance. We corroborate these theoretical results with a large set of experiments over datasets originating from a variety of sources and applications (Web, DNA sequencing, information retrieval and natural language processing), and we show that a carefully engineered version of our approach provides new interesting space-time trade-offs with respect to several well-established implementations of Elias-Fano, RRR-vector, and random-access vectors of Elias γ/ή-coded gaps
    • 

    corecore