1,172 research outputs found
Down the Rabbit Hole: Robust Proximity Search and Density Estimation in Sublinear Space
For a set of points in , and parameters and \eps, we present
a data structure that answers (1+\eps,k)-\ANN queries in logarithmic time.
Surprisingly, the space used by the data-structure is \Otilde (n /k); that
is, the space used is sublinear in the input size if is sufficiently large.
Our approach provides a novel way to summarize geometric data, such that
meaningful proximity queries on the data can be carried out using this sketch.
Using this, we provide a sublinear space data-structure that can estimate the
density of a point set under various measures, including:
\begin{inparaenum}[(i)]
\item sum of distances of closest points to the query point, and
\item sum of squared distances of closest points to the query point.
\end{inparaenum}
Our approach generalizes to other distance based estimation of densities of
similar flavor. We also study the problem of approximating some of these
quantities when using sampling. In particular, we show that a sample of size
\Otilde (n /k) is sufficient, in some restricted cases, to estimate the above
quantities. Remarkably, the sample size has only linear dependency on the
dimension
Robust Proximity Search for Balls using Sublinear Space
Given a set of n disjoint balls b1, . . ., bn in IRd, we provide a data
structure, of near linear size, that can answer (1 \pm \epsilon)-approximate
kth-nearest neighbor queries in O(log n + 1/\epsilon^d) time, where k and
\epsilon are provided at query time. If k and \epsilon are provided in advance,
we provide a data structure to answer such queries, that requires (roughly)
O(n/k) space; that is, the data structure has sublinear space requirement if k
is sufficiently large
Using Fuzzy Linguistic Representations to Provide Explanatory Semantics for Data Warehouses
A data warehouse integrates large amounts of extracted and summarized data from multiple sources for direct querying and analysis. While it provides decision makers with easy access to such historical and aggregate data, the real meaning of the data has been ignored. For example, "whether a total sales amount 1,000 items indicates a good or bad sales performance" is still unclear. From the decision makers' point of view, the semantics rather than raw numbers which convey the meaning of the data is very important. In this paper, we explore the use of fuzzy technology to provide this semantics for the summarizations and aggregates developed in data warehousing systems. A three layered data warehouse semantic model, consisting of quantitative (numerical) summarization, qualitative (categorical) summarization, and quantifier summarization, is proposed for capturing and explicating the semantics of warehoused data. Based on the model, several algebraic operators are defined. We also extend the SQL language to allow for flexible queries against such enhanced data warehouses
Approximate Nearest Neighbor Search for Low Dimensional Queries
We study the Approximate Nearest Neighbor problem for metric spaces where the
query points are constrained to lie on a subspace of low doubling dimension,
while the data is high-dimensional. We show that this problem can be solved
efficiently despite the high dimensionality of the data.Comment: 25 page
Data Cube Approximation and Mining using Probabilistic Modeling
On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction levels in a dimension hierarchy. However, such techniques are not aimed at mining multidimensional data.
Since data cubes are nothing but multi-way tables, we propose to analyze the potential of two probabilistic modeling techniques, namely non-negative multi-way array factorization and log-linear modeling, with the ultimate objective of compressing and mining aggregate and multidimensional values. With the first technique, we compute the set of components that best fit the initial data set and whose superposition coincides with the original data; with the second technique we identify a parsimonious model (i.e., one with a reduced set of parameters), highlight strong associations among dimensions and discover possible outliers in data cells. A real life example will be
used to (i) discuss the potential benefits of the modeling output on cube exploration and mining, (ii) show how OLAP queries can be answered in an approximate way, and (iii) illustrate the strengths and limitations of these modeling approaches
Indexability, concentration, and VC theory
Degrading performance of indexing schemes for exact similarity search in high
dimensions has long since been linked to histograms of distributions of
distances and other 1-Lipschitz functions getting concentrated. We discuss this
observation in the framework of the phenomenon of concentration of measure on
the structures of high dimension and the Vapnik-Chervonenkis theory of
statistical learning.Comment: 17 pages, final submission to J. Discrete Algorithms (an expanded,
improved and corrected version of the SISAP'2010 invited paper, this e-print,
v3
An Adaptive Mechanism for Accurate Query Answering under Differential Privacy
We propose a novel mechanism for answering sets of count- ing queries under
differential privacy. Given a workload of counting queries, the mechanism
automatically selects a different set of "strategy" queries to answer
privately, using those answers to derive answers to the workload. The main
algorithm proposed in this paper approximates the optimal strategy for any
workload of linear counting queries. With no cost to the privacy guarantee, the
mechanism improves significantly on prior approaches and achieves near-optimal
error for many workloads, when applied under (\epsilon, \delta)-differential
privacy. The result is an adaptive mechanism which can help users achieve good
utility without requiring that they reason carefully about the best formulation
of their task.Comment: VLDB2012. arXiv admin note: substantial text overlap with
arXiv:1103.136
- …