9 research outputs found

    RDF DATABASES – CASE STUDY AND PERFORMANCE EVALUATION

    Get PDF
    The Resource Description Framework (RDF) data presentation model and the SPARQL query language have been the core of the semantic web technologies since the early 2000’s. In this article, we evaluate three RDF storage technologies. Our motivation is to find a storage solution that can be used to process “big data” RDF sets. Our method is based on measuring query response times with large samples (hundreds of thousands of RDF documents, millions of RDF statements). We find that all the proposed technologies provide much better performance than querying RDF data stored in files. However, with 300 000 documents, even with the fastest technology, an aggregation query still lasts more than 100 seconds in our environment. As a further performance improvement, we test the same data and queries with MongoDB, demonstrate its performance (10 seconds instead of 100) and scalability (up to 1000 000 documents). However, despite its benefits we must note that because of its data presentation and query limitations, MongoDB probably cannot serve as a generic storage for all kinds of RDF documents

    Self-organizing Structured RDF in MonetDB

    Get PDF
    The semantic web uses RDF as its data model, providing ultimate flexibility for users to represent and evolve data without need of a schema. Yet, this flexibility poses challenges in implementing efficient RDF stores, leading from plans with very many self-joins to a triple table, difficulties to optimize these, and a lack of data locality since without a notion of multi-attribute data structure, clustered indexing opportunities are lost. Apart from performance issues, users of huge RDF graphs often have problems formulating queries as they lack any system-supported notion of the structure in the data. In this research, we exploit the observation that real RDF data, while not as regularly structured as relational data, still has the great majority of triples conforming to regular patterns. We conjecture that a system that would recognize this structure automatically would both allow RDF stores to become more efficient and also easier to use. Concretely, we propose to derive self-organizing RDF that stores data in PSO format in such a way that the regular parts of the data physically correspond to relational columnar storage; and propose RDFscan/RDFjoin algorithms that compute star-patterns over these without wasting effort in self-joins. These regular parts, i.e. tables, are identified on ingestion by a schema discovery algorithm -- as such users will gain an SQL view of the regular part of the RDF data. This research aims to produce a state-of-the-art SPARQL frontend for MonetDB as a by-product, and we already present some preliminary results on this platform

    Partout: A Distributed Engine for Efficient RDF Processing

    Full text link
    The increasing interest in Semantic Web technologies has led not only to a rapid growth of semantic data on the Web but also to an increasing number of backend applications with already more than a trillion triples in some cases. Confronted with such huge amounts of data and the future growth, existing state-of-the-art systems for storing RDF and processing SPARQL queries are no longer sufficient. In this paper, we introduce Partout, a distributed engine for efficient RDF processing in a cluster of machines. We propose an effective approach for fragmenting RDF data sets based on a query log, allocating the fragments to nodes in a cluster, and finding the optimal configuration. Partout can efficiently handle updates and its query optimizer produces efficient query execution plans for ad-hoc SPARQL queries. Our experiments show the superiority of our approach to state-of-the-art approaches for partitioning and distributed SPARQL query processing

    Exploiting emergent schemas to make RDF systems more efficient

    Get PDF
    We build on our earlier finding that more than 95 % of the triples in actual RDF triple graphs have a remarkably tabular structure, whose schema does not necessarily follow from explicit metadata such as ontologies, but for which an RDF store can automatically derive by looking at the data using so-called “emergent schema” detection techniques. In this paper we investigate how computers and in particular RDF stores can take advantage from this emergent schema to more compactly store RDF data and more efficiently optimize and execute SPARQL queries. To this end, we contribute techniques for efficient emergent schema aware RDF storage and new query operator algorithms for emergent schema aware scans and joins. In all, these techniques allow RDF schema processors fully catch up with relational database techniques in terms of rich physical database design options and efficiency, without requiring a rigid upfront schema structure definition

    A study of graph partitioning techniques for fast indexing and query processing of a large RDF graph

    Get PDF
    Title from PDF of title page, viewed on October 21, 2013Thesis advisor: Praveen R. RaoVitaIncludes bibliographic references (pages 58-62)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2013In recent years, the Resource Description Framework (RDF) [34] has become increasingly important for the Web and in domains such as defense and healthcare. Companies such as the New York Times [40], Best Buy [39], and Pfizer are leveraging RDF and other Semantic Web technologies for data management. Using RDF, any assertion can be represented as a (subject, predicate, object) triple. The collection of triples together represents a graph. Many techniques have been developed for RDF indexing and query processing and the most popular among them store and process RDF data using an RDBMS. In this thesis, we study the impact of existing graph partitioning techniques on indexing and query processing of a large RDF graph (e.g., YAGO [40]) with millions of edges and vertices. Our goal is to partition a large RDF graph into smaller graphs and then index the smaller graphs efficiently for faster query processing. In order to cope with cut edges, we compute the 2-hop distance across each cut edge. Once partitions are computed, we construct an index using a recently developed technique called RIS. Queries are also processed using RIS. We report the benefits and trade-offs of two different partitioning strategies using the YAGO dataset on metrics such as index construction time, index size, and query processing time. The first partitioning strategy treats the original RDF graph as an unweighted graph during partitioning. The second strategy treats the original graph as a weighted graph during partitioning. We compared the results obtained by RIS (on partitioned graphs) with RDF-3X [38], a state-of-the-art RDF query processing engine.Introduction -- Background -- Proposed design -- Implementation -- Performance evaluation -- Conclusion and future work -- Appendix A. Algorithm

    A new filtering index for fast processing of SPARQL queries

    Get PDF
    Title from PDF of title page, viewed on October 21, 2013VitaThesis advisor: Praveen RaoIncludes bibliographic references (pages 78-82)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2013The Resource Description Framework (RDF) has become a popular data model for representing data on the Web. Using RDF, any assertion can be represented as a (subject, predicate, object) triple. Essentially, RDF datasets can be viewed as directed, labeled graphs. Queries on RDF data are written using the SPARQL query language and contain basic graph patterns (BGPs). We present a new filtering index and query processing technique for processing large BGPs in SPARQL queries. Our approach called RIS treats RDF graphs as "first-class citizens." Unlike previous scalable approaches that store RDF data as triples in an RDBMS and process SPARQL queries by executing appropriate SQL queries, RIS aims to speed up query processing by reducing the processing cost of join operations. In RIS, RDF graphs are mapped into signatures, which are multisets. These signatures are grouped based on a similarity metric and indexed using Counting Bloom Filters. During query processing, the Counting Bloom Filters are checked to filter out non-matches, and finally the candidates are verified using Apache Jena. The filtering step prunes away a large portion of the dataset and results in faster processing of queries. We have conducted an in-depth performance evaluation using the Lehigh University Benchmark (LUBM) dataset and SPARQL queries containing large BGPs. We compared RIS with RDF-3X, which is a state-of-the-art scalable RDF querying engine that uses an RDBMS. RIS can significantly outperform RDF-3X in terms of total execution time for the tested dataset and queries.Introduction -- Motivation and related work -- Background -- Bloom filters and Bloom counters -- System architecture -- Signature tree generation -- Querying the signature tree -- Evaluation -- Experiments -- Conclusio

    Emergent relational schemas for RDF

    Get PDF
    corecore