194 research outputs found

    On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

    Full text link
    Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper presents an in-depth analysis and experimental comparison of five representative and complementary distribution approaches. For achieving fair experimental results, we are using Apache Spark as a common parallel computing framework by rewriting the concerned algorithms using the Spark API. Spark provides guarantees in terms of fault tolerance, high availability and scalability which are essential in such systems. Our different implementations aim to highlight the fundamental implementation-independent characteristics of each approach in terms of data preparation, load balancing, data replication and to some extent to query answering cost and performance. The presented measures are obtained by testing each system on one synthetic and one real-world data set over query workloads with differing characteristics and different partitioning constraints.Comment: 16 pages, 3 figure

    Distributed Processing of Generalized Graph-Pattern Queries in SPARQL 1.1

    Get PDF
    We propose an efficient and scalable architecture for processing generalized graph-pattern queries as they are specified by the current W3C recommendation of the SPARQL 1.1 "Query Language" component. Specifically, the class of queries we consider consists of sets of SPARQL triple patterns with labeled property paths. From a relational perspective, this class resolves to conjunctive queries of relational joins with additional graph-reachability predicates. For the scalable, i.e., distributed, processing of this kind of queries over very large RDF collections, we develop a suitable partitioning and indexing scheme, which allows us to shard the RDF triples over an entire cluster of compute nodes and to process an incoming SPARQL query over all of the relevant graph partitions (and thus compute nodes) in parallel. Unlike most prior works in this field, we specifically aim at the unified optimization and distributed processing of queries consisting of both relational joins and graph-reachability predicates. All communication among the compute nodes is established via a proprietary, asynchronous communication protocol based on the Message Passing Interface

    Towards Making Distributed RDF processing FLINker

    Get PDF
    In the last decade, the Resource Description Framework (RDF) has become the de-facto standard for publishing semantic data on the Web. This steady adoption has led to a significant increase in the number and volume of available RDF datasets, exceeding the capabilities of traditional RDF stores. This scenario has introduced severe big semantic data challenges when it comes to managing and querying RDF data at Web scale. Despite the existence of various off-the-shelf Big Data platforms, processing RDF in a distributed environment remains a significant challenge. In this position paper, based on an indepth analysis of the state of the art, we propose to manage large RDF datasets in Flink, a well-known scalable distributed Big Data processing framework. Our approach, which we refer to as FLINKer extends the native graph abstraction of Flink, called Gelly, with RDF graph and SPARQL query processing capabilities

    NORA: Scalable OWL reasoner based on NoSQL databasesand Apache Spark

    Get PDF
    Reasoning is the process of inferring new knowledge and identifying inconsis-tencies within ontologies. Traditional techniques often prove inadequate whenreasoning over large Knowledge Bases containing millions or billions of facts.This article introduces NORA, a persistent and scalable OWL reasoner built ontop of Apache Spark, designed to address the challenges of reasoning over exten-sive and complex ontologies. NORA exploits the scalability of NoSQL databasesto effectively apply inference rules to Big Data ontologies with large ABoxes. Tofacilitatescalablereasoning,OWLdata,includingclassandpropertyhierarchiesand instances, are materialized in the Apache Cassandra database. Spark pro-grams are then evaluated iteratively, uncovering new implicit knowledge fromthe dataset and leading to enhanced performance and more efficient reasoningover large-scale ontologies. NORA has undergone a thorough evaluation withdifferent benchmarking ontologies of varying sizes to assess the scalability of thedeveloped solution.Funding for open access charge: Universidad de Málaga / CBUA This work has been partially funded by grant (funded by MCIN/AEI/10.13039/501100011033/) PID2020-112540RB-C41,AETHER-UMA (A smart data holistic approach for context-aware data analytics: semantics and context exploita-tion). Antonio Benítez-Hidalgo is supported by Grant PRE2018-084280 (Spanish Ministry of Science, Innovation andUniversities)
    • …
    corecore