113 research outputs found
Scalable RDF Data Compression using X10
The Semantic Web comprises enormous volumes of semi-structured data elements.
For interoperability, these elements are represented by long strings. Such
representations are not efficient for the purposes of Semantic Web applications
that perform computations over large volumes of information. A typical method
for alleviating the impact of this problem is through the use of compression
methods that produce more compact representations of the data. The use of
dictionary encoding for this purpose is particularly prevalent in Semantic Web
database systems. However, centralized implementations present performance
bottlenecks, giving rise to the need for scalable, efficient distributed
encoding schemes. In this paper, we describe an encoding implementation based
on the asynchronous partitioned global address space (APGAS) parallel
programming model. We evaluate performance on a cluster of up to 384 cores and
datasets of up to 11 billion triples (1.9 TB). Compared to the state-of-art
MapReduce algorithm, we demonstrate a speedup of 2.6-7.4x and excellent
scalability. These results illustrate the strong potential of the APGAS model
for efficient implementation of dictionary encoding and contributes to the
engineering of larger scale Semantic Web applications
Efficient Parallel Dictionary Encoding for RDF Data.
The SemanticWeb comprises enormous volumes of semi-structured data elements. For interoperability, these elements are represented by long strings. Such representations are not efficient for the purposes
of SemanticWeb applications that perform computations over
large volumes of information. A typical method for alleviating the impact of this problem is through the use of compression methods that produce more compact representations of the data. The use of dictionary encoding for this purpose is particularly prevalent in Semantic Web database systems. However, centralized implementations
present performance bottlenecks, giving rise to the
need for scalable, efficient distributed encoding schemes. In this paper, we describe a straightforward but very efficient encoding algorithm and evaluate its performance on a cluster of up to 384 cores and datasets of up to 11 billion triples (1.9 TB). Compared to the state-of-art MapReduce algorithm, we demonstrate a speedup of
2:6 - 7:4x and excellent scalability
Efficient Parallel Dictionary Encoding for RDF Data.
The SemanticWeb comprises enormous volumes of semi-structured data elements. For interoperability, these elements are represented by long strings. Such representations are not efficient for the purposes
of SemanticWeb applications that perform computations over
large volumes of information. A typical method for alleviating the impact of this problem is through the use of compression methods that produce more compact representations of the data. The use of dictionary encoding for this purpose is particularly prevalent in Semantic Web database systems. However, centralized implementations
present performance bottlenecks, giving rise to the
need for scalable, efficient distributed encoding schemes. In this paper, we describe a straightforward but very efficient encoding algorithm and evaluate its performance on a cluster of up to 384 cores and datasets of up to 11 billion triples (1.9 TB). Compared to the state-of-art MapReduce algorithm, we demonstrate a speedup of
2:6 - 7:4x and excellent scalability
A scalable analysis framework for large-scale RDF data
With the growth of the Semantic Web, the availability of RDF datasets from multiple domains
as Linked Data has taken the corpora of this web to a terabyte-scale, and challenges
modern knowledge storage and discovery techniques. Research and engineering on RDF
data management systems is a very active area with many standalone systems being introduced.
However, as the size of RDF data increases, such single-machine approaches meet
performance bottlenecks, in terms of both data loading and querying, due to the limited
parallelism inherent to symmetric multi-threaded systems and the limited available system
I/O and system memory. Although several approaches for distributed RDF data processing
have been proposed, along with clustered versions of more traditional approaches, their
techniques are limited by the trade-off they exploit between loading complexity and query
efficiency in the presence of big RDF data. This thesis then, introduces a scalable analysis
framework for processing large-scale RDF data, which focuses on various techniques to
reduce inter-machine communication, computation and load-imbalancing so as to achieve
fast data loading and querying on distributed infrastructures.
The first part of this thesis focuses on the study of RDF store implementation and parallel
hashing on big data processing. (1) A system-level investigation of RDF store implementation
has been conducted on the basis of a comparative analysis of runtime characteristics
of a representative set of RDF stores. The detailed time cost and system consumption is
measured for data loading and querying so as to provide insight into different triple store
implementation as well as an understanding of performance differences between different
platforms. (2) A high-level structured parallel hashing approach over distributed memory is
proposed and theoretically analyzed. The detailed performance of hashing implementations
using different lock-free strategies has been characterized through extensive experiments,
thereby allowing system developers to make a more informed choice for the implementation
of their high-performance analytical data processing systems.
The second part of this thesis proposes three main techniques for fast processing of large
RDF data within the proposed framework. (1) A very efficient parallel dictionary encoding
algorithm, to avoid unnecessary disk-space consumption and reduce computational complexity of query execution. The presented implementation has achieved notable speedups
compared to the state-of-art method and also has achieved excellent scalability. (2) Several
novel parallel join algorithms, to efficiently handle skew over large data during query processing.
The approaches have achieved good load balancing and have been demonstrated
to be faster than the state-of-art techniques in both theoretical and experimental comparisons.
(3) A two-tier dynamic indexing approach for processing SPARQL queries has been
devised which keeps loading times low and decreases or in some instances removes intermachine
data movement for subsequent queries that contain the same graph patterns. The
results demonstrate that this design can load data at least an order of magnitude faster than
a clustered store operating in RAM while remaining within an interactive range for query
processing and even outperforms current systems for various queries
LiteMat: a scalable, cost-efficient inference encoding scheme for large RDF graphs
The number of linked data sources and the size of the linked open data graph
keep growing every day. As a consequence, semantic RDF services are more and
more confronted with various "big data" problems. Query processing in the
presence of inferences is one them. For instance, to complete the answer set of
SPARQL queries, RDF database systems evaluate semantic RDFS relationships
(subPropertyOf, subClassOf) through time-consuming query rewriting algorithms
or space-consuming data materialization solutions. To reduce the memory
footprint and ease the exchange of large datasets, these systems generally
apply a dictionary approach for compressing triple data sizes by replacing
resource identifiers (IRIs), blank nodes and literals with integer values. In
this article, we present a structured resource identification scheme using a
clever encoding of concepts and property hierarchies for efficiently evaluating
the main common RDFS entailment rules while minimizing triple materialization
and query rewriting. We will show how this encoding can be computed by a
scalable parallel algorithm and directly be implemented over the Apache Spark
framework. The efficiency of our encoding scheme is emphasized by an evaluation
conducted over both synthetic and real world datasets.Comment: 8 pages, 1 figur
Extending the Internet of Things to the future Internet through IPv6 Support
Emerging Internet of Things (IoT)/Machine-to-Machine (M2M) systems require a transparent access to information and services through a seamless integration into the Future Internet. This integration exploits infrastructure and services found on the Internet by the IoT. On the one hand, the so-called Web of Things aims for direct Web connectivity by pushing its technology down to devices and smart things. On the other hand, the current and Future Internet offer stable, scalable, extensive, and tested protocols for node and service discovery, mobility, security, and auto-configuration, which are also required for the IoT. In order to integrate the IoT into the Internet, this work adapts, extends, and bridges using IPv6 the existing IoT building blocks (such as solutions from IEEE 802.15.4, BT-LE, RFID) while maintaining backwards compatibility with legacy networked embedded systems from building and industrial automation. Specifically, this work presents an extended Internet stack with a set of adaptation layers from non-IP towards the IPv6-based network layer in order to enable homogeneous access for applications and services
Selectivity Estimation for SPARQL Triple Patterns with Shape Expressions
International audienceShEx (Shape Expressions) is a language for expressing constraints on RDF graphs. In this work we optimize the evaluation of conjunctive SPARQL queries, on RDF graphs, by taking advantage of ShEx constraints. Our optimization is based on computing and assigning ranks to query triple patterns , dictating their order of execution. We first define a set of well formed ShEx schemas, that possess interesting characteristics for SPARQL query optimization. We then define our optimization method by exploiting information extracted from a ShEx schema. We finally report on evaluation results performed showing the advantages of applying our optimization on the top of an existing state-of-the-art query evaluation system
Sensor function virtualization to support distributed intelligence in the internet of things
It is estimated that-by 2020-billion devices will be connected to the Internet. This number not only includes TVs, PCs, tablets and smartphones, but also billions of embedded sensors that will make up the "Internet of Things" and enable a whole new range of intelligent services in domains such as manufacturing, health, smart homes, logistics, etc. To some extent, intelligence such as data processing or access control can be placed on the devices themselves. Alternatively, functionalities can be outsourced to the cloud. In reality, there is no single solution that fits all needs. Cooperation between devices, intermediate infrastructures (local networks, access networks, global networks) and/or cloud systems is needed in order to optimally support IoT communication and IoT applications. Through distributed intelligence the right communication and processing functionality will be available at the right place. The first part of this paper motivates the need for such distributed intelligence based on shortcomings in typical IoT systems. The second part focuses on the concept of sensor function virtualization, a potential enabler for distributed intelligence, and presents solutions on how to realize it
- …