2,378 research outputs found
Handling Massive N-Gram Datasets Efficiently
This paper deals with the two fundamental problems concerning the handling of
large n-gram language models: indexing, that is compressing the n-gram strings
and associated satellite data without compromising their retrieval speed; and
estimation, that is computing the probability distribution of the strings from
a large textual source. Regarding the problem of indexing, we describe
compressed, exact and lossless data structures that achieve, at the same time,
high space reductions and no time degradation with respect to state-of-the-art
solutions and related software packages. In particular, we present a compressed
trie data structure in which each word following a context of fixed length k,
i.e., its preceding k words, is encoded as an integer whose value is
proportional to the number of words that follow such context. Since the number
of words following a given context is typically very small in natural
languages, we lower the space of representation to compression levels that were
never achieved before. Despite the significant savings in space, our technique
introduces a negligible penalty at query time. Regarding the problem of
estimation, we present a novel algorithm for estimating modified Kneser-Ney
language models, that have emerged as the de-facto choice for language modeling
in both academia and industry, thanks to their relatively low perplexity
performance. Estimating such models from large textual sources poses the
challenge of devising algorithms that make a parsimonious use of the disk. The
state-of-the-art algorithm uses three sorting steps in external memory: we show
an improved construction that requires only one sorting step thanks to
exploiting the properties of the extracted n-gram strings. With an extensive
experimental analysis performed on billions of n-grams, we show an average
improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February
2019, Article No: 2
Round-Hashing for Data Storage: Distributed Servers and External-Memory Tables
This paper proposes round-hashing, which is suitable for data storage on distributed servers and for implementing external-memory tables in which each lookup retrieves at most one single block of external memory, using a stash. For data storage, round-hashing is like consistent hashing as it avoids a full rehashing of the keys when new servers are added. Experiments show that the speed to serve requests is tenfold or more than the state of the art. In distributed data storage, this guarantees better throughput for serving requests and, moreover, greatly reduces decision times for which data should move to new servers as rescanning data is much faster
Learned Monotone Minimal Perfect Hashing
A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of
keys is a function that maps each key in S to its rank. On keys not in S, the
function returns an arbitrary value. Applications range from databases, search
engines, data encryption, to pattern-matching algorithms.
In this paper, we describe LeMonHash, a new technique for constructing MMPHFs
for integers. The core idea of LeMonHash is surprisingly simple and effective:
we learn a monotone mapping from keys to their rank via an error-bounded
piecewise linear model (the PGM-index), and then we solve the collisions that
might arise among keys mapping to the same rank estimate by associating small
integers with them in a retrieval data structure (BuRR). On synthetic random
datasets, LeMonHash needs 35% less space than the next best competitor, while
achieving about 16 times faster queries. On real-world datasets, the space
usage is very close to or much better than the best competitors, while
achieving up to 19 times faster queries than the next larger competitor. As far
as the construction of LeMonHash is concerned, we get an improvement by a
factor of up to 2, compared to the competitor with the next best space usage.
We also investigate the case of keys being variable-length strings,
introducing the so-called LeMonHash-VL: it needs space within 10% of the best
competitors while achieving up to 3 times faster queries
InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure
Volumetric models have become a popular representation for 3D scenes in
recent years. One breakthrough leading to their popularity was KinectFusion,
which focuses on 3D reconstruction using RGB-D sensors. However, monocular SLAM
has since also been tackled with very similar approaches. Representing the
reconstruction volumetrically as a TSDF leads to most of the simplicity and
efficiency that can be achieved with GPU implementations of these systems.
However, this representation is memory-intensive and limits applicability to
small-scale reconstructions. Several avenues have been explored to overcome
this. With the aim of summarizing them and providing for a fast, flexible 3D
reconstruction pipeline, we propose a new, unifying framework called InfiniTAM.
The idea is that steps like camera tracking, scene representation and
integration of new data can easily be replaced and adapted to the user's needs.
This report describes the technical implementation details of InfiniTAM v3,
the third version of our InfiniTAM system. We have added various new features,
as well as making numerous enhancements to the low-level code that
significantly improve our camera tracking performance. The new features that we
expect to be of most interest are (i) a robust camera tracking module; (ii) an
implementation of Glocker et al.'s keyframe-based random ferns camera
relocaliser; (iii) a novel approach to globally-consistent TSDF-based
reconstruction, based on dividing the scene into rigid submaps and optimising
the relative poses between them; and (iv) an implementation of Keller et al.'s
surfel-based reconstruction approach.Comment: This article largely supersedes arxiv:1410.0925 (it describes version
3 of the InfiniTAM framework
Knowledge Extraction in Video Through the Interaction Analysis of Activities
Video is a massive amount of data that contains complex interactions between moving objects. The extraction of knowledge from this type of information creates a demand for video analytics systems that uncover statistical relationships between activities and learn the correspondence between content and labels. However, those are open research problems that have high complexity when multiple actors simultaneously perform activities, videos contain noise, and streaming scenarios are considered. The techniques introduced in this dissertation provide a basis for analyzing video. The primary contributions of this research consist of providing new algorithms for the efficient search of activities in video, scene understanding based on interactions between activities, and the predicting of labels for new scenes
Recommended from our members
Distributed simulation and the grid: Position statements
The Grid provides a new and unrivaled technology for large scale distributed simulation as it enables collaboration and the use of distributed computing resources. This panel paper presents the views of four researchers in the area of Distributed Simulation and the Grid. Together we try to identify the main research issues involved in applying Grid technology to distributed simulation and the key future challenges that need to be solved to achieve this goal. Such challenges include not only technical challenges, but also political ones such as management methodology for the Grid and the development of standards. The benefits of the Grid to end-user simulation modelers also are discussed
- …