7,616 research outputs found
Estimating Cardinalities with Deep Sketches
We introduce Deep Sketches, which are compact models of databases that allow
us to estimate the result sizes of SQL queries. Deep Sketches are powered by a
new deep learning approach to cardinality estimation that can capture
correlations between columns, even across tables. Our demonstration allows
users to define such sketches on the TPC-H and IMDb datasets, monitor the
training process, and run ad-hoc queries against trained sketches. We also
estimate query cardinalities with HyPer and PostgreSQL to visualize the gains
over traditional cardinality estimators.Comment: To appear in SIGMOD'1
Statistical structures for internet-scale data management
Efficient query processing in traditional database management systems relies on statistics on base data. For centralized systems, there is a rich body of research results on such statistics, from simple aggregates to more elaborate synopses such as sketches and histograms. For Internet-scale distributed systems, on the other hand, statistics management still poses major challenges. With the work in this paper we aim to endow peer-to-peer data management over structured overlays with the power associated with such statistical information, with emphasis on meeting the scalability challenge. To this end, we first contribute efficient, accurate, and decentralized algorithms that can compute key aggregates such as Count, CountDistinct, Sum, and Average. We show how to construct several types of histograms, such as simple Equi-Width, Average-Shifted Equi-Width, and Equi-Depth histograms. We present a full-fledged open-source implementation of these tools for distributed statistical synopses, and report on a comprehensive experimental performance evaluation, evaluating our contributions in terms of efficiency, accuracy, and scalability
One Table to Count Them All: Parallel Frequency Estimation on Single-Board Computers
Sketches are probabilistic data structures that can provide approximate
results within mathematically proven error bounds while using orders of
magnitude less memory than traditional approaches. They are tailored for
streaming data analysis on architectures even with limited memory such as
single-board computers that are widely exploited for IoT and edge computing.
Since these devices offer multiple cores, with efficient parallel sketching
schemes, they are able to manage high volumes of data streams. However, since
their caches are relatively small, a careful parallelization is required. In
this work, we focus on the frequency estimation problem and evaluate the
performance of a high-end server, a 4-core Raspberry Pi and an 8-core Odroid.
As a sketch, we employed the widely used Count-Min Sketch. To hash the stream
in parallel and in a cache-friendly way, we applied a novel tabulation approach
and rearranged the auxiliary tables into a single one. To parallelize the
process with performance, we modified the workflow and applied a form of
buffering between hash computations and sketch updates. Today, many
single-board computers have heterogeneous processors in which slow and fast
cores are equipped together. To utilize all these cores to their full
potential, we proposed a dynamic load-balancing mechanism which significantly
increased the performance of frequency estimation.Comment: 12 pages, 4 figures, 3 algorithms, 1 table, submitted to EuroPar'1
Cross-Paced Representation Learning with Partial Curricula for Sketch-based Image Retrieval
In this paper we address the problem of learning robust cross-domain
representations for sketch-based image retrieval (SBIR). While most SBIR
approaches focus on extracting low- and mid-level descriptors for direct
feature matching, recent works have shown the benefit of learning coupled
feature representations to describe data from two related sources. However,
cross-domain representation learning methods are typically cast into non-convex
minimization problems that are difficult to optimize, leading to unsatisfactory
performance. Inspired by self-paced learning, a learning methodology designed
to overcome convergence issues related to local optima by exploiting the
samples in a meaningful order (i.e. easy to hard), we introduce the cross-paced
partial curriculum learning (CPPCL) framework. Compared with existing
self-paced learning methods which only consider a single modality and cannot
deal with prior knowledge, CPPCL is specifically designed to assess the
learning pace by jointly handling data from dual sources and modality-specific
prior information provided in the form of partial curricula. Additionally,
thanks to the learned dictionaries, we demonstrate that the proposed CPPCL
embeds robust coupled representations for SBIR. Our approach is extensively
evaluated on four publicly available datasets (i.e. CUFS, Flickr15K, QueenMary
SBIR and TU-Berlin Extension datasets), showing superior performance over
competing SBIR methods
- …