36,161 research outputs found
Handling Massive N-Gram Datasets Efficiently
This paper deals with the two fundamental problems concerning the handling of
large n-gram language models: indexing, that is compressing the n-gram strings
and associated satellite data without compromising their retrieval speed; and
estimation, that is computing the probability distribution of the strings from
a large textual source. Regarding the problem of indexing, we describe
compressed, exact and lossless data structures that achieve, at the same time,
high space reductions and no time degradation with respect to state-of-the-art
solutions and related software packages. In particular, we present a compressed
trie data structure in which each word following a context of fixed length k,
i.e., its preceding k words, is encoded as an integer whose value is
proportional to the number of words that follow such context. Since the number
of words following a given context is typically very small in natural
languages, we lower the space of representation to compression levels that were
never achieved before. Despite the significant savings in space, our technique
introduces a negligible penalty at query time. Regarding the problem of
estimation, we present a novel algorithm for estimating modified Kneser-Ney
language models, that have emerged as the de-facto choice for language modeling
in both academia and industry, thanks to their relatively low perplexity
performance. Estimating such models from large textual sources poses the
challenge of devising algorithms that make a parsimonious use of the disk. The
state-of-the-art algorithm uses three sorting steps in external memory: we show
an improved construction that requires only one sorting step thanks to
exploiting the properties of the extracted n-gram strings. With an extensive
experimental analysis performed on billions of n-grams, we show an average
improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February
2019, Article No: 2
Regular Expression Search on Compressed Text
We present an algorithm for searching regular expression matches in
compressed text. The algorithm reports the number of matching lines in the
uncompressed text in time linear in the size of its compressed version. We
define efficient data structures that yield nearly optimal complexity bounds
and provide a sequential implementation --zearch-- that requires up to 25% less
time than the state of the art.Comment: 10 pages, published in Data Compression Conference (DCC'19
Sparse Recovery with Very Sparse Compressed Counting
Compressed sensing (sparse signal recovery) often encounters nonnegative data
(e.g., images). Recently we developed the methodology of using (dense)
Compressed Counting for recovering nonnegative K-sparse signals. In this paper,
we adopt very sparse Compressed Counting for nonnegative signal recovery. Our
design matrix is sampled from a maximally-skewed p-stable distribution (0<p<1),
and we sparsify the design matrix so that on average (1-g)-fraction of the
entries become zero. The idea is related to very sparse stable random
projections (Li et al 2006 and Li 2007), the prior work for estimating summary
statistics of the data.
In our theoretical analysis, we show that, when p->0, it suffices to use M=
K/(1-exp(-gK) log N measurements, so that all coordinates can be recovered in
one scan of the coordinates. If g = 1 (i.e., dense design), then M = K log N.
If g= 1/K or 2/K (i.e., very sparse design), then M = 1.58K log N or M = 1.16K
log N. This means the design matrix can be indeed very sparse at only a minor
inflation of the sample complexity.
Interestingly, as p->1, the required number of measurements is essentially M
= 2.7K log N, provided g= 1/K. It turns out that this result is a general
worst-case bound
Efficient high-dimensional entanglement imaging with a compressive sensing, double-pixel camera
We implement a double-pixel, compressive sensing camera to efficiently
characterize, at high resolution, the spatially entangled fields produced by
spontaneous parametric downconversion. This technique leverages sparsity in
spatial correlations between entangled photons to improve acquisition times
over raster-scanning by a scaling factor up to n^2/log(n) for n-dimensional
images. We image at resolutions up to 1024 dimensions per detector and
demonstrate a channel capacity of 8.4 bits per photon. By comparing the
classical mutual information in conjugate bases, we violate an entropic
Einstein-Podolsky-Rosen separability criterion for all measured resolutions.
More broadly, our result indicates compressive sensing can be especially
effective for higher-order measurements on correlated systems.Comment: 10 pages, 7 figure
- …