69 research outputs found
Recommended from our members
UT Linked Data Informal Learning Discussion Series Bibliography
The following bibliographies were created as part of the UT Linked Data Informal Learning Discussion Series hosted at UT LibrariesThe following bibliographies were created as part of the UT Linked Data Informal Learning Discussion Series hosted at UT LibrariesUT Librarie
HAQWA: a Hash-based and Query Workload Aware Distributed RDF Store
Abstract. Like most data models encountered in the Big Data ecosystem, RDF stores are managing large data sets by partitioning triples across a cluster of machines. Nevertheless, the graphical nature of RDF data as well as its associated SPARQL query execution model makes the efficient data distribution more involved than in other data models, e.g., relational. In this paper, we propose a novel system that is characterized by a trade-off between complexity of data partitioning and efficiency of query answering in cases where a query workload is known. The prototype is implemented over the Apache Spark framework, ensuring high availability, fault tolerance and scalability. This short paper presents the main features of the system and highlights the omnipresence of parallel computation across data fragmentation and allocation, encoding and query processing tasks
On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark
Querying very large RDF data sets in an efficient manner requires a
sophisticated distribution strategy. Several innovative solutions have recently
been proposed for optimizing data distribution with predefined query workloads.
This paper presents an in-depth analysis and experimental comparison of five
representative and complementary distribution approaches. For achieving fair
experimental results, we are using Apache Spark as a common parallel computing
framework by rewriting the concerned algorithms using the Spark API. Spark
provides guarantees in terms of fault tolerance, high availability and
scalability which are essential in such systems. Our different implementations
aim to highlight the fundamental implementation-independent characteristics of
each approach in terms of data preparation, load balancing, data replication
and to some extent to query answering cost and performance. The presented
measures are obtained by testing each system on one synthetic and one
real-world data set over query workloads with differing characteristics and
different partitioning constraints.Comment: 16 pages, 3 figure
Distributed Processing of Generalized Graph-Pattern Queries in SPARQL 1.1
We propose an efficient and scalable architecture for processing generalized
graph-pattern queries as they are specified by the current W3C recommendation
of the SPARQL 1.1 "Query Language" component. Specifically, the class of
queries we consider consists of sets of SPARQL triple patterns with labeled
property paths. From a relational perspective, this class resolves to
conjunctive queries of relational joins with additional graph-reachability
predicates. For the scalable, i.e., distributed, processing of this kind of
queries over very large RDF collections, we develop a suitable partitioning and
indexing scheme, which allows us to shard the RDF triples over an entire
cluster of compute nodes and to process an incoming SPARQL query over all of
the relevant graph partitions (and thus compute nodes) in parallel. Unlike most
prior works in this field, we specifically aim at the unified optimization and
distributed processing of queries consisting of both relational joins and
graph-reachability predicates. All communication among the compute nodes is
established via a proprietary, asynchronous communication protocol based on the
Message Passing Interface
Amada: Web Data Repositories in the Amazon Cloud
ABSTRACT We present Amada, a platform for storing Web data (in particular, XML documents and RDF graphs) based on the Amazon Web Services (AWS) cloud infrastructure. Amada operates in a Software as a Service (SaaS) approach, allowing users to upload, index, store, and query large volumes of Web data. The demonstration shows (i) the step-by-step procedure for building and exploiting the warehouse (storing, indexing, querying) and (ii) the monitoring tools enabling one to control the expenses (monetary costs) charged by AWS for the operations involved while running Amada
Scalable RDF Data Compression using X10
The Semantic Web comprises enormous volumes of semi-structured data elements.
For interoperability, these elements are represented by long strings. Such
representations are not efficient for the purposes of Semantic Web applications
that perform computations over large volumes of information. A typical method
for alleviating the impact of this problem is through the use of compression
methods that produce more compact representations of the data. The use of
dictionary encoding for this purpose is particularly prevalent in Semantic Web
database systems. However, centralized implementations present performance
bottlenecks, giving rise to the need for scalable, efficient distributed
encoding schemes. In this paper, we describe an encoding implementation based
on the asynchronous partitioned global address space (APGAS) parallel
programming model. We evaluate performance on a cluster of up to 384 cores and
datasets of up to 11 billion triples (1.9 TB). Compared to the state-of-art
MapReduce algorithm, we demonstrate a speedup of 2.6-7.4x and excellent
scalability. These results illustrate the strong potential of the APGAS model
for efficient implementation of dictionary encoding and contributes to the
engineering of larger scale Semantic Web applications
Parallel SPARQL Query Execution using Apache Spark
Title from PDF of title page, viewed on February 21, 2017Thesis advisor: Praveen RaoVitaIncludes bibliographical references (pages 48-52)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2016Semantic Web technologies such as Resource Description Framework (RDF) and
SPARQL are increasingly being adopted by applications on the Web, as well as in domains such
as healthcare, finance, and national security and intelligence. While we have witnessed an era of
many different techniques for RDF indexing and SPARQL query processing, the rapid growth in
the size of RDF knowledge bases demands scalable techniques that can leverage the power of
cluster computing. Big data ecosystems like Apache Spark provide new opportunities for
designing scalable RDF indexing and query processing techniques.
In this thesis, we present new ideas on storing, indexing, and query processing of RDF
datasets with billions of RDF statements. In our approach, we will leverage Resilient Distributed
Datasets (RDDs) and MapReduce in Spark and the graph processing capability of GraphX. The
key idea is to partition the RDF dataset, build indexes on the partitions, and execute a query in
parallel on the collection of indexes. A key theme of our design is to enable in-memory
processing of the indexes for fast query processing.Introduction -- Background and motivation -- Proposed approach -- Implementation of the system -- Performance evaluation -- Conclusion and future wor
- …