366 research outputs found

    Mapping Large Scale Research Metadata to Linked Data: A Performance Comparison of HBase, CSV and XML

    Full text link
    OpenAIRE, the Open Access Infrastructure for Research in Europe, comprises a database of all EC FP7 and H2020 funded research projects, including metadata of their results (publications and datasets). These data are stored in an HBase NoSQL database, post-processed, and exposed as HTML for human consumption, and as XML through a web service interface. As an intermediate format to facilitate statistical computations, CSV is generated internally. To interlink the OpenAIRE data with related data on the Web, we aim at exporting them as Linked Open Data (LOD). The LOD export is required to integrate into the overall data processing workflow, where derived data are regenerated from the base data every day. We thus faced the challenge of identifying the best-performing conversion approach.We evaluated the performances of creating LOD by a MapReduce job on top of HBase, by mapping the intermediate CSV files, and by mapping the XML output.Comment: Accepted in 0th Metadata and Semantics Research Conferenc

    Temporal RDF(S) Data Storage and Query with HBase

    Get PDF
    Resource Description Framework (RDF) is a metadata model recommended by World Wide Web Consortium (W3C) for describing the Web resources. With the arrival of the era of Big Data, very large amounts of RDF data are continuously being created and need to be stored for management. The traditional centralized RDF storage models cannot meet the need of largescale RDF data storage. Meanwhile, the importance of temporal information management and processing has been acknowledged by academia and industry. In this paper, we propose a storage model to store temporal RDF based on HBase. The proposed storage model applies the built-in time mechanism of HBase. Our experiments on LUBM dataset with temporal information added show that our storage model can store large temporal RDF data and obtain good query efficiency

    NORA: Scalable OWL reasoner based on NoSQL databasesand Apache Spark

    Get PDF
    Reasoning is the process of inferring new knowledge and identifying inconsis-tencies within ontologies. Traditional techniques often prove inadequate whenreasoning over large Knowledge Bases containing millions or billions of facts.This article introduces NORA, a persistent and scalable OWL reasoner built ontop of Apache Spark, designed to address the challenges of reasoning over exten-sive and complex ontologies. NORA exploits the scalability of NoSQL databasesto effectively apply inference rules to Big Data ontologies with large ABoxes. Tofacilitatescalablereasoning,OWLdata,includingclassandpropertyhierarchiesand instances, are materialized in the Apache Cassandra database. Spark pro-grams are then evaluated iteratively, uncovering new implicit knowledge fromthe dataset and leading to enhanced performance and more efficient reasoningover large-scale ontologies. NORA has undergone a thorough evaluation withdifferent benchmarking ontologies of varying sizes to assess the scalability of thedeveloped solution.Funding for open access charge: Universidad de Málaga / CBUA This work has been partially funded by grant (funded by MCIN/AEI/10.13039/501100011033/) PID2020-112540RB-C41,AETHER-UMA (A smart data holistic approach for context-aware data analytics: semantics and context exploita-tion). Antonio Benítez-Hidalgo is supported by Grant PRE2018-084280 (Spanish Ministry of Science, Innovation andUniversities)

    Translation of Heterogeneous Databases into RDF, and Application to the Construction of a SKOS Taxonomical Reference

    Get PDF
    International audienceWhile the data deluge accelerates, most of the data produced remains locked in deep Web databases. For the linked open data to benefit from the potential represented by this huge amount of data, it is crucial to come up with solutions to expose heterogeneous databases as linked data. The xR2RML mapping language is an endeavor towards this goal: it is designed to map various types of databases to RDF, by flexibly adapting to heterogeneous query languages and data models while remaining free from any specific language. It extends R2RML, the W3C recommendation for the mapping of relational databases to RDF, and relies on RML for the handling of various data formats. In this paper we present xR2RML, we analyse data models of several modern databases as well as the format in which query results are returned , and we show how xR2RML translates any result data element into RDF, relying on existing languages such as XPath and JSONPath when necessary. We illustrate some features of xR2RML such as the generation of RDF collections and containers, and the ability to deal with mixed data formats. We also describe a real-world use case in which we applied xR2RML to build a SKOS thesaurus aimed at supporting studies on History of Zoology, Archaeozoology and Conservation Biology

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed

    MapReduce-based Solutions for Scalable SPARQL Querying

    Get PDF
    The use of RDF to expose semantic data on the Web has seen a dramatic increase over the last few years. Nowadays, RDF datasets are so big and rconnected that, in fact, classical mono-node solutions present significant scalability problems when trying to manage big semantic data. MapReduce, a standard framework for distributed processing of great quantities of data, is earning a place among the distributed solutions facing RDF scalability issues. In this article, we survey the most important works addressing RDF management and querying through diverse MapReduce approaches, with a focus on their main strategies, optimizations and results

    On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

    Full text link
    Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper presents an in-depth analysis and experimental comparison of five representative and complementary distribution approaches. For achieving fair experimental results, we are using Apache Spark as a common parallel computing framework by rewriting the concerned algorithms using the Spark API. Spark provides guarantees in terms of fault tolerance, high availability and scalability which are essential in such systems. Our different implementations aim to highlight the fundamental implementation-independent characteristics of each approach in terms of data preparation, load balancing, data replication and to some extent to query answering cost and performance. The presented measures are obtained by testing each system on one synthetic and one real-world data set over query workloads with differing characteristics and different partitioning constraints.Comment: 16 pages, 3 figure
    • …
    corecore