Search CORE

69 research outputs found

Recommended from our members

UT Linked Data Informal Learning Discussion Series Bibliography

Author: Carbajal Itza
Publication venue
Publication date: 01/01/2019
Field of study

The following bibliographies were created as part of the UT Linked Data Informal Learning Discussion Series hosted at UT LibrariesThe following bibliographies were created as part of the UT Linked Data Informal Learning Discussion Series hosted at UT LibrariesUT Librarie

Texas ScholarWorks

HAQWA: a Hash-based and Query Workload Aware Distributed RDF Store

Author
Publication venue
Publication date: 05/03/2020
Field of study

Abstract. Like most data models encountered in the Big Data ecosystem, RDF stores are managing large data sets by partitioning triples across a cluster of machines. Nevertheless, the graphical nature of RDF data as well as its associated SPARQL query execution model makes the efficient data distribution more involved than in other data models, e.g., relational. In this paper, we propose a novel system that is characterized by a trade-off between complexity of data partitioning and efficiency of query answering in cases where a query workload is known. The prototype is implemented over the Apache Spark framework, ensuring high availability, fault tolerance and scalability. This short paper presents the main features of the system and highlights the omnipresence of parallel computation across data fragmentation and allocation, encoding and query processing tasks

CiteSeerX

On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

Author: Amann Bernd
Baazizi Mohamed-Amine
Curé Olivier
Naacke Hubert
Publication venue
Publication date: 08/07/2015
Field of study

Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper presents an in-depth analysis and experimental comparison of five representative and complementary distribution approaches. For achieving fair experimental results, we are using Apache Spark as a common parallel computing framework by rewriting the concerned algorithms using the Spark API. Spark provides guarantees in terms of fault tolerance, high availability and scalability which are essential in such systems. Our different implementations aim to highlight the fundamental implementation-independent characteristics of each approach in terms of data preparation, load balancing, data replication and to some extent to query answering cost and performance. The presented measures are obtained by testing each system on one synthetic and one real-world data set over query workloads with differing characteristics and different partitioning constraints.Comment: 16 pages, 3 figure

arXiv.org e-Print Archive

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Distributed Processing of Generalized Graph-Pattern Queries in SPARQL 1.1

Author: Gurajada Sairam
Theobald Martin
Publication venue
Publication date: 01/01/2016
Field of study

We propose an efficient and scalable architecture for processing generalized graph-pattern queries as they are specified by the current W3C recommendation of the SPARQL 1.1 "Query Language" component. Specifically, the class of queries we consider consists of sets of SPARQL triple patterns with labeled property paths. From a relational perspective, this class resolves to conjunctive queries of relational joins with additional graph-reachability predicates. For the scalable, i.e., distributed, processing of this kind of queries over very large RDF collections, we develop a suitable partitioning and indexing scheme, which allows us to shard the RDF triples over an entire cluster of compute nodes and to process an incoming SPARQL query over all of the relevant graph partitions (and thus compute nodes) in parallel. Unlike most prior works in this field, we specifically aim at the unified optimization and distributed processing of queries consisting of both relational joins and graph-reachability predicates. All communication among the compute nodes is established via a proprietary, asynchronous communication protocol based on the Message Passing Interface

arXiv.org e-Print Archive

Open Repository and Bibliography - Luxembourg

MPG.PuRe

Amada: Web Data Repositories in the Amazon Cloud

Author: Andrés Aranda-Andújar
Dario Colazzo
Francesca Bugiotti
François Goasdoué
Ioana Manolescu
Jesús Camacho-Rodríguez
Zoi Kaoudi
Publication venue
Publication date: 03/04/2020
Field of study

ABSTRACT We present Amada, a platform for storing Web data (in particular, XML documents and RDF graphs) based on the Amazon Web Services (AWS) cloud infrastructure. Amada operates in a Software as a Service (SaaS) approach, allowing users to upload, index, store, and query large volumes of Web data. The demonstration shows (i) the step-by-step procedure for building and exploiting the warehouse (storing, indexing, querying) and (ii) the monitoring tools enabling one to control the expenses (monetary costs) charged by AWS for the operations involved while running Amada

CiteSeerX

Scalable RDF Data Compression using X10

Author: Cheng Long
Kotoulas Spyros
Malik Avinash
Theodoropoulos Georgios
Ward Tomas E
Publication venue
Publication date: 01/01/2014
Field of study

The Semantic Web comprises enormous volumes of semi-structured data elements. For interoperability, these elements are represented by long strings. Such representations are not efficient for the purposes of Semantic Web applications that perform computations over large volumes of information. A typical method for alleviating the impact of this problem is through the use of compression methods that produce more compact representations of the data. The use of dictionary encoding for this purpose is particularly prevalent in Semantic Web database systems. However, centralized implementations present performance bottlenecks, giving rise to the need for scalable, efficient distributed encoding schemes. In this paper, we describe an encoding implementation based on the asynchronous partitioned global address space (APGAS) parallel programming model. We evaluate performance on a cluster of up to 384 cores and datasets of up to 11 billion triples (1.9 TB). Compared to the state-of-art MapReduce algorithm, we demonstrate a speedup of 2.6-7.4x and excellent scalability. These results illustrate the strong potential of the APGAS model for efficient implementation of dictionary encoding and contributes to the engineering of larger scale Semantic Web applications

arXiv.org e-Print Archive

MURAL - Maynooth University Research Archive Library

NUI Maynooth Eprint Archive

Maynooth University ePrints and eTheses Archive

Parallel SPARQL Query Execution using Apache Spark

Author: Jangid Hastimal
Publication venue: University of Missouri--Kansas City
Publication date
Field of study

Title from PDF of title page, viewed on February 21, 2017Thesis advisor: Praveen RaoVitaIncludes bibliographical references (pages 48-52)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2016Semantic Web technologies such as Resource Description Framework (RDF) and SPARQL are increasingly being adopted by applications on the Web, as well as in domains such as healthcare, finance, and national security and intelligence. While we have witnessed an era of many different techniques for RDF indexing and SPARQL query processing, the rapid growth in the size of RDF knowledge bases demands scalable techniques that can leverage the power of cluster computing. Big data ecosystems like Apache Spark provide new opportunities for designing scalable RDF indexing and query processing techniques. In this thesis, we present new ideas on storing, indexing, and query processing of RDF datasets with billions of RDF statements. In our approach, we will leverage Resilient Distributed Datasets (RDDs) and MapReduce in Spark and the graph processing capability of GraphX. The key idea is to partition the RDF dataset, build indexes on the partitions, and execute a query in parallel on the collection of indexes. A key theme of our design is to enable in-memory processing of the indexes for fast query processing.Introduction -- Background and motivation -- Proposed approach -- Implementation of the system -- Performance evaluation -- Conclusion and future wor

University of Missouri: MOspace