3 research outputs found
The Case for Learned Spatial Indexes
Spatial data is ubiquitous. Massive amounts of data are generated every day
from billions of GPS-enabled devices such as cell phones, cars, sensors, and
various consumer-based applications such as Uber, Tinder, location-tagged posts
in Facebook, Twitter, Instagram, etc. This exponential growth in spatial data
has led the research community to focus on building systems and applications
that can process spatial data efficiently. In the meantime, recent research has
introduced learned index structures. In this work, we use techniques proposed
from a state-of-the art learned multi-dimensional index structure (namely,
Flood) and apply them to five classical multi-dimensional indexes to be able to
answer spatial range queries. By tuning each partitioning technique for optimal
performance, we show that (i) machine learned search within a partition is
faster by 11.79\% to 39.51\% than binary search when using filtering on one
dimension, (ii) the bottleneck for tree structures is index lookup, which could
potentially be improved by linearizing the indexed partitions (iii) filtering
on one dimension and refining using machine learned indexes is 1.23x to 1.83x
times faster than closest competitor which filters on two dimensions, and (iv)
learned indexes can have a significant impact on the performance of low
selectivity queries while being less effective under higher selectivities
Efficient Analytical Queries on Semantic Web Data Cubes
The amount of multidimensional data published on the semantic web (SW) is
constantly increasing, due to initiatives such as Open Data and Open Government
Data, among other ones. Models, languages, and tools, that allow to obtain
valuable information efficiently, are thus required. Multidimensional data are
typically represented as data cubes, and exploited using Online Analytical
Processing (OLAP) techniques. The RDF Data Cube Vocabulary, also denoted QB, is
the current W3C standard to represent statistical data on the SW.Since QB does
not include key features needed for OLAP analysis, in previous work we have
proposed an extension, denoted QB4OLAP, to overcome this problem without the
need of modifying already published data. Once data cubes are represented on
the SW, we need tools to analyze them. However, writing efficient analytical
queries over SW cubes demands a deep knowledge of RDF and SPARQL. These skills
are not common in typical analytical users. Also, OLAP languages like MDX are
far from being easily understood by the final user. The lack of friendly tools
to exploit multidimensional data on the SW is a barrier that needs to be broken
to promote the publication of such data. We address this problem in this paper.
Our approach is based on allowing analytical users to write queries using OLAP
operations over cubes, without dealing with SW standards. For this, we devised
CQL (standing for Cube Query Language), a simple, high-level query language
that operates over cubes. Using the metadata provided by QB4OLAP, we translate
CQL queries into SPARQL. Then, we propose query improvement strategies to
produce efficient SPARQL queries, adapting SPARQL query optimization
techniques. We evaluate our approach using the Star-Schema benchmark, showing
that our proposal outperforms others. A web application that allows querying SW
data cubes using CQL, completes our contributions
Efficient Graph Computation for Node2Vec
Node2Vec is a state-of-the-art general-purpose feature learning method for
network analysis. However, current solutions cannot run Node2Vec on large-scale
graphs with billions of vertices and edges, which are common in real-world
applications. The existing distributed Node2Vec on Spark incurs significant
space and time overhead. It runs out of memory even for mid-sized graphs with
millions of vertices. Moreover, it considers at most 30 edges for every vertex
in generating random walks, causing poor result quality. In this paper, we
propose Fast-Node2Vec, a family of efficient Node2Vec random walk algorithms on
a Pregel-like graph computation framework. Fast-Node2Vec computes transition
probabilities during random walks to reduce memory space consumption and
computation overhead for large-scale graphs. The Pregel-like scheme avoids
space and time overhead of Spark's read-only RDD structures and shuffle
operations. Moreover, we propose a number of optimization techniques to further
reduce the computation overhead for popular vertices with large degrees.
Empirical evaluation show that Fast-Node2Vec is capable of computing Node2Vec
on graphs with billions of vertices and edges on a mid-sized machine cluster.
Compared to Spark-Node2Vec, Fast-Node2Vec achieves 7.7--122x speedups