9 research outputs found
RDF DATABASES – CASE STUDY AND PERFORMANCE EVALUATION
The Resource Description Framework (RDF) data presentation model and the SPARQL query language have been the core of the semantic web technologies since the early 2000’s. In this article, we evaluate three RDF storage technologies. Our motivation is to find a storage solution that can be used to process “big data” RDF sets. Our method is based on measuring query response times with large samples (hundreds of thousands of RDF documents, millions of RDF statements). We find that all the proposed technologies provide much better performance than querying RDF data stored in files. However, with 300 000 documents, even with the fastest technology, an aggregation query still lasts more than 100 seconds in our environment. As a further performance improvement, we test the same data and queries with MongoDB, demonstrate its performance (10 seconds instead of 100) and scalability (up to 1000 000 documents). However, despite its benefits we must note that because of its data presentation and query limitations, MongoDB probably cannot serve as a generic storage for all kinds of RDF documents
Self-organizing Structured RDF in MonetDB
The semantic web uses RDF as its data model, providing ultimate flexibility
for users to represent and evolve data without need of a schema.
Yet, this flexibility poses challenges in implementing efficient RDF
stores, leading from plans with very many self-joins to a triple table,
difficulties to optimize these, and a lack of data locality since without
a notion of multi-attribute data structure, clustered indexing opportunities are lost.
Apart from performance issues, users of huge RDF graphs often have problems
formulating queries as they lack any system-supported notion of the structure in the data.
In this research, we exploit the observation that real RDF data, while not as regularly
structured as relational data, still has the great majority of triples conforming to regular patterns.
We conjecture that a system that would recognize this structure automatically
would both allow RDF stores to become more efficient and also easier to use.
Concretely, we propose to derive self-organizing RDF that stores data
in PSO format in such a way that the regular parts of the data physically
correspond to relational columnar storage; and propose RDFscan/RDFjoin algorithms
that compute star-patterns over these without wasting effort in self-joins.
These regular parts, i.e. tables, are identified on ingestion by a schema discovery
algorithm -- as such users will gain an SQL view of the regular part of the RDF data.
This research aims to produce a state-of-the-art SPARQL frontend for MonetDB
as a by-product, and we already present some preliminary results on this platform
Partout: A Distributed Engine for Efficient RDF Processing
The increasing interest in Semantic Web technologies has led not only to a
rapid growth of semantic data on the Web but also to an increasing number of
backend applications with already more than a trillion triples in some cases.
Confronted with such huge amounts of data and the future growth, existing
state-of-the-art systems for storing RDF and processing SPARQL queries are no
longer sufficient. In this paper, we introduce Partout, a distributed engine
for efficient RDF processing in a cluster of machines. We propose an effective
approach for fragmenting RDF data sets based on a query log, allocating the
fragments to nodes in a cluster, and finding the optimal configuration. Partout
can efficiently handle updates and its query optimizer produces efficient query
execution plans for ad-hoc SPARQL queries. Our experiments show the superiority
of our approach to state-of-the-art approaches for partitioning and distributed
SPARQL query processing
Exploiting emergent schemas to make RDF systems more efficient
We build on our earlier finding that more than 95 % of the
triples in actual RDF triple graphs have a remarkably tabular structure,
whose schema does not necessarily follow from explicit metadata such as
ontologies, but for which an RDF store can automatically derive by looking
at the data using so-called “emergent schema” detection techniques.
In this paper we investigate how computers and in particular RDF stores
can take advantage from this emergent schema to more compactly store
RDF data and more efficiently optimize and execute SPARQL queries.
To this end, we contribute techniques for efficient emergent schema aware
RDF storage and new query operator algorithms for emergent schema
aware scans and joins. In all, these techniques allow RDF schema processors
fully catch up with relational database techniques in terms of rich
physical database design options and efficiency, without requiring a rigid
upfront schema structure definition
A study of graph partitioning techniques for fast indexing and query processing of a large RDF graph
Title from PDF of title page, viewed on October 21, 2013Thesis advisor: Praveen R. RaoVitaIncludes bibliographic references (pages 58-62)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2013In recent years, the Resource Description Framework (RDF) [34] has become increasingly important for the Web and in domains such as defense and healthcare. Companies such as the New York Times [40], Best Buy [39], and Pfizer are leveraging RDF and other Semantic Web technologies for data management. Using RDF, any assertion can be represented as a (subject, predicate, object) triple. The collection of triples together represents a graph. Many techniques have been developed for RDF indexing and query processing and the most popular among them store and process RDF data using an RDBMS. In this thesis, we study the impact of existing graph partitioning techniques on indexing and query processing of a large RDF graph (e.g., YAGO [40]) with millions of edges and vertices. Our goal is to partition a large RDF graph into smaller graphs and then index the smaller graphs efficiently for faster query processing. In order to cope with cut edges, we compute the 2-hop distance across each cut edge. Once partitions are computed, we construct an index using a recently developed technique called RIS. Queries are also processed using RIS. We report the benefits and trade-offs of two different partitioning strategies using the YAGO dataset on metrics such as index construction time, index size, and query processing time. The first partitioning strategy treats the original RDF graph as an unweighted graph during partitioning. The second strategy treats the original graph as a weighted graph during partitioning. We compared the results obtained by RIS (on partitioned graphs) with RDF-3X [38], a state-of-the-art RDF query processing engine.Introduction -- Background -- Proposed design -- Implementation -- Performance evaluation -- Conclusion and future work -- Appendix A. Algorithm
A new filtering index for fast processing of SPARQL queries
Title from PDF of title page, viewed on October 21, 2013VitaThesis advisor: Praveen RaoIncludes bibliographic references (pages 78-82)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2013The Resource Description Framework (RDF) has become a popular data model for
representing data on the Web. Using RDF, any assertion can be represented as a (subject,
predicate, object) triple. Essentially, RDF datasets can be viewed as directed, labeled
graphs. Queries on RDF data are written using the SPARQL query language and contain
basic graph patterns (BGPs). We present a new filtering index and query processing
technique for processing large BGPs in SPARQL queries. Our approach called RIS treats
RDF graphs as "first-class citizens." Unlike previous scalable approaches that store RDF
data as triples in an RDBMS and process SPARQL queries by executing appropriate SQL
queries, RIS aims to speed up query processing by reducing the processing cost of join
operations. In RIS, RDF graphs are mapped into signatures, which are multisets. These
signatures are grouped based on a similarity metric and indexed using Counting Bloom
Filters. During query processing, the Counting Bloom Filters are checked to filter out
non-matches, and finally the candidates are verified using Apache Jena. The filtering step
prunes away a large portion of the dataset and results in faster processing of queries. We
have conducted an in-depth performance evaluation using the Lehigh University
Benchmark (LUBM) dataset and SPARQL queries containing large BGPs. We compared RIS with RDF-3X, which is a state-of-the-art scalable RDF querying engine that uses an RDBMS. RIS can significantly outperform RDF-3X in terms of total execution time for the tested dataset and queries.Introduction -- Motivation and related work -- Background -- Bloom filters and Bloom counters -- System architecture -- Signature tree generation -- Querying the signature tree -- Evaluation -- Experiments -- Conclusio