5,818 research outputs found
Indexing the Earth Mover's Distance Using Normal Distributions
Querying uncertain data sets (represented as probability distributions)
presents many challenges due to the large amount of data involved and the
difficulties comparing uncertainty between distributions. The Earth Mover's
Distance (EMD) has increasingly been employed to compare uncertain data due to
its ability to effectively capture the differences between two distributions.
Computing the EMD entails finding a solution to the transportation problem,
which is computationally intensive. In this paper, we propose a new lower bound
to the EMD and an index structure to significantly improve the performance of
EMD based K-nearest neighbor (K-NN) queries on uncertain databases. We propose
a new lower bound to the EMD that approximates the EMD on a projection vector.
Each distribution is projected onto a vector and approximated by a normal
distribution, as well as an accompanying error term. We then represent each
normal as a point in a Hough transformed space. We then use the concept of
stochastic dominance to implement an efficient index structure in the
transformed space. We show that our method significantly decreases K-NN query
time on uncertain databases. The index structure also scales well with database
cardinality. It is well suited for heterogeneous data sets, helping to keep EMD
based queries tractable as uncertain data sets become larger and more complex.Comment: VLDB201
Reverse Nearest Neighbor Heat Maps: A Tool for Influence Exploration
We study the problem of constructing a reverse nearest neighbor (RNN) heat
map by finding the RNN set of every point in a two-dimensional space. Based on
the RNN set of a point, we obtain a quantitative influence (i.e., heat) for the
point. The heat map provides a global view on the influence distribution in the
space, and hence supports exploratory analyses in many applications such as
marketing and resource management. To construct such a heat map, we first
reduce it to a problem called Region Coloring (RC), which divides the space
into disjoint regions within which all the points have the same RNN set. We
then propose a novel algorithm named CREST that efficiently solves the RC
problem by labeling each region with the heat value of its containing points.
In CREST, we propose innovative techniques to avoid processing expensive RNN
queries and greatly reduce the number of region labeling operations. We perform
detailed analyses on the complexity of CREST and lower bounds of the RC
problem, and prove that CREST is asymptotically optimal in the worst case.
Extensive experiments with both real and synthetic data sets demonstrate that
CREST outperforms alternative algorithms by several orders of magnitude.Comment: Accepted to appear in ICDE 201
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
Enhancing SpatialHadoop with Closest Pair Queries
Given two datasets P and Q, the K Closest Pair Query (KCPQ) finds the K closest pairs of objects from P Ă—Q. It is an operation widely adopted by many spatial and GIS applications. As a combination of the K Nearest Neighbor (KNN) and the spatial join queries, KCPQ is an expensive operation. Given the increasing volume of spatial data, it is difficult to perform a KCPQ on a centralized machine efficiently. For this reason, this paper addresses the problem of computing the KCPQ on big spatial datasets in SpatialHadoop, an extension of Hadoop that supports spatial operations efficiently, and proposes a novel algorithm in SpatialHadoop to perform efficient parallel KCPQ on large-scale spatial datasets. We have evaluated the performance of the algorithm in several situations with big synthetic and real-world datasets. The experiments have demonstrated the efficiency and scalability of our proposal
- …