596 research outputs found
Data structures and compression algorithms for high-throughput sequencing technologies
<p>Abstract</p> <p>Background</p> <p>High-throughput sequencing (HTS) technologies play important roles in the life sciences by allowing the rapid parallel sequencing of very large numbers of relatively short nucleotide sequences, in applications ranging from genome sequencing and resequencing to digital microarrays and ChIP-Seq experiments. As experiments scale up, HTS technologies create new bioinformatics challenges for the storage and sharing of HTS data.</p> <p>Results</p> <p>We develop data structures and compression algorithms for HTS data. A processing stage maps short sequences to a reference genome or a large table of sequences. Then the integers representing the short sequence absolute or relative addresses, their length, and the substitutions they may contain are compressed and stored using various entropy coding algorithms, including both old and new fixed codes (e.g Golomb, Elias Gamma, MOV) and variable codes (e.g. Huffman). The general methodology is illustrated and applied to several HTS data sets. Results show that the information contained in HTS files can be compressed by a factor of 10 or more, depending on the statistical properties of the data sets and various other choices and constraints. Our algorithms fair well against general purpose compression programs such as gzip, bzip2 and 7zip; timing results show that our algorithms are consistently faster than the best general purpose compression programs.</p> <p>Conclusions</p> <p>It is not likely that exactly one encoding strategy will be optimal for all types of HTS data. Different experimental conditions are going to generate various data distributions whereby one encoding strategy can be more effective than another. We have implemented some of our encoding algorithms into the software package GenCompress which is available upon request from the authors. With the advent of HTS technology and increasingly new experimental protocols for using the technology, sequence databases are expected to continue rising in size. The methodology we have proposed is general, and these advanced compression techniques should allow researchers to manage and share their HTS data in a more timely fashion.</p
Hall Normalization Constants for the Bures Volumes of the n-State Quantum Systems
We report the results of certain integrations of quantum-theoretic interest,
relying, in this regard, upon recently developed parameterizations of Boya et
al of the n x n density matrices, in terms of squared components of the unit
(n-1)-sphere and the n x n unitary matrices. Firstly, we express the normalized
volume elements of the Bures (minimal monotone) metric for n = 2 and 3,
obtaining thereby "Bures prior probability distributions" over the two- and
three-state systems. Then, as an essential first step in extending these
results to n > 3, we determine that the "Hall normalization constant" (C_{n})
for the marginal Bures prior probability distribution over the
(n-1)-dimensional simplex of the n eigenvalues of the n x n density matrices
is, for n = 4, equal to 71680/pi^2. Since we also find that C_{3} = 35/pi, it
follows that C_{4} is simply equal to 2^{11} C_{3}/pi. (C_{2} itself is known
to equal 2/pi.) The constant C_{5} is also found. It too is associated with a
remarkably simple decompositon, involving the product of the eight consecutive
prime numbers from 2 to 23.
We also preliminarily investigate several cases, n > 5, with the use of
quasi-Monte Carlo integration. We hope that the various analyses reported will
prove useful in deriving a general formula (which evidence suggests will
involve the Bernoulli numbers) for the Hall normalization constant for
arbitrary n. This would have diverse applications, including quantum inference
and universal quantum coding.Comment: 14 pages, LaTeX, 6 postscript figures. Revised version to appear in
J. Phys. A. We make a few slight changes from the previous version, but also
add a subsection (III G) in which several variations of the basic problem are
newly studied. Rather strong evidence is adduced that the Hall constants are
related to partial sums of denominators of the even-indexed Bernoulli
numbers, although a general formula is still lackin
Handling Massive N-Gram Datasets Efficiently
This paper deals with the two fundamental problems concerning the handling of
large n-gram language models: indexing, that is compressing the n-gram strings
and associated satellite data without compromising their retrieval speed; and
estimation, that is computing the probability distribution of the strings from
a large textual source. Regarding the problem of indexing, we describe
compressed, exact and lossless data structures that achieve, at the same time,
high space reductions and no time degradation with respect to state-of-the-art
solutions and related software packages. In particular, we present a compressed
trie data structure in which each word following a context of fixed length k,
i.e., its preceding k words, is encoded as an integer whose value is
proportional to the number of words that follow such context. Since the number
of words following a given context is typically very small in natural
languages, we lower the space of representation to compression levels that were
never achieved before. Despite the significant savings in space, our technique
introduces a negligible penalty at query time. Regarding the problem of
estimation, we present a novel algorithm for estimating modified Kneser-Ney
language models, that have emerged as the de-facto choice for language modeling
in both academia and industry, thanks to their relatively low perplexity
performance. Estimating such models from large textual sources poses the
challenge of devising algorithms that make a parsimonious use of the disk. The
state-of-the-art algorithm uses three sorting steps in external memory: we show
an improved construction that requires only one sorting step thanks to
exploiting the properties of the extracted n-gram strings. With an extensive
experimental analysis performed on billions of n-grams, we show an average
improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February
2019, Article No: 2
Attribute Value Reordering For Efficient Hybrid OLAP
The normalization of a data cube is the ordering of the attribute values. For
large multidimensional arrays where dense and sparse chunks are stored
differently, proper normalization can lead to improved storage efficiency. We
show that it is NP-hard to compute an optimal normalization even for 1x3
chunks, although we find an exact algorithm for 1x2 chunks. When dimensions are
nearly statistically independent, we show that dimension-wise attribute
frequency sorting is an optimal normalization and takes time O(d n log(n)) for
data cubes of size n^d. When dimensions are not independent, we propose and
evaluate several heuristics. The hybrid OLAP (HOLAP) storage mechanism is
already 19%-30% more efficient than ROLAP, but normalization can improve it
further by 9%-13% for a total gain of 29%-44% over ROLAP
A "Learned" Approach to Quicken and Compress Rank/Select Dictionaries
We address the well-known problem of designing, implementing and experimenting compressed data structures for supporting rank and select queries over a dictionary of integers. This problem has been studied far and wide since the end of the â80s with tons of important theoretical and practical results.
Following a recent line of research on the so-called learned data structures, we first show that this problem has a surprising connection with the geometry of a set of points in the Cartesian plane suitably derived from the input integers. We then build upon some classical results in computational geometry to introduce the first âlearnedâ scheme for implementing a compressed rank/select dictionary. We prove theoretical bounds on its time and space performance both in the worst case and in the case of input distributions with finite mean and variance.
We corroborate these theoretical results with a large set of experiments over datasets originating from a variety of sources and applications (Web, DNA sequencing, information retrieval and natural language processing), and we show that a carefully engineered version of our approach provides new interesting space-time trade-offs with respect to several well-established implementations of Elias-Fano, RRR-vector, and random-access vectors of Elias Îł/ÎŽ-coded gaps
- âŠ