9 research outputs found
Predicting large scale fine grain energy consumption
Today a large volume of energy-related data have been continuously collected. Extracting actionable knowledge from such data is a multi-step process that opens up a variety of interesting and novel research issues across two domains: energy and computer science. The computer science aim is to provide energy scientists with cutting-edge and scalable engines to effectively support them in their daily research activities. This paper presents SPEC, a scalable and distributed predictor of fine grain energy consumption in buildings. SPEC exploits a data stream methodology analysis over a sliding time window to train a prediction model tailored to each building. The building model is then exploited to predict the upcoming energy consumption at a time instant in the near future. SPEC currently integrates the artificial neural networks technique and the random forest regression algorithm. The SPEC methodology exploits the computational advantages of distributed computing frameworks as the current implementation runs on Spark. As a case study, real data of thermal energy consumption collected in a major city have been exploited to preliminarily assess the SPEC accuracy. The initial results are promising and represent a first step towards predicting fine grain energy consumption over a sliding time window
PyRQA -- Conducting Recurrence Quantification Analysis on Very Long Time Series Efficiently
PyRQA is a software package that efficiently conducts recurrence
quantification analysis (RQA) on time series consisting of more than one
million data points. RQA is a method from non-linear time series analysis that
quantifies the recurrent behaviour of systems. Existing implementations to RQA
are not capable of analysing such very long time series at all or require large
amounts of time to calculate the quantitative measures. PyRQA overcomes their
limitations by conducting the RQA computations in a highly parallel manner.
Building on the OpenCL framework, PyRQA leverages the computing capabilities of
a variety of parallel hardware architectures, such as GPUs. The underlying
computing approach partitions the RQA computations and enables to employ
multiple compute devices at the same time. The goal of this publication is to
demonstrate the features and the runtime efficiency of PyRQA. For this purpose
we employ a real-world example, comparing the dynamics of two climatological
time series, and a synthetic example, reducing the runtime regarding the
analysis of a series consisting of over one million data points from almost
eight hours using state-of-the-art RQA software to roughly 69 seconds using
PyRQA.Comment: 15 pages, 3 figure
Finite Automata Algorithms in Map-Reduce
In this thesis the intersection of several large nondeterministic finite automata (NFA's) as well as minimization of a large deterministic finite automaton (DFA) in map-reduce are studied. We have derived a lower bound on replication rate for computing NFA intersections and provided three concrete algorithms for the problem. Our investigation of the replication rate for each of all three algorithms shows where each algorithm could be applied through detailed experiments on large datasets of finite automata. Denoting n the number of states in DFA A, we propose an algorithm to minimize A in n map-reduce rounds in the worst-case. Our experiments, however, indicate that the number of rounds, in practice, is much smaller than n for all DFA's we examined. In other words, this algorithm converges in d iterations by computing the equivalence classes of each state, where d is the diameter of the input DFA
Flexible query processing of SPARQL queries
SPARQL is the predominant language for querying RDF data, which is the standard
model for representing web data and more specifically Linked Open Data (a
collection of heterogeneous connected data). Datasets in RDF form can be hard to
query by a user if she does not have a full knowledge of the structure of the dataset.
Moreover, many datasets in Linked Data are often extracted from actual web page
content which might lead to incomplete or inaccurate data.
We extend SPARQL 1.1 with two operators, APPROX and RELAX, previously
introduced in the context of regular path queries. Using these operators we are able
to support
exible querying over the property path queries of SPARQL 1.1. We call
this new language SPARQLAR.
Using SPARQLAR users are able to query RDF data without fully knowing the
structure of a dataset. APPROX and RELAX encapsulate different aspects of query flexibility: finding different answers and finding more answers, respectively. This
means that users can access complex and heterogeneous datasets without the need
to know precisely how the data is structured.
One of the open problems we address is how to combine the APPROX and
RELAX operators with a pragmatic language such as SPARQL. We also devise an
implementation of a system that evaluates SPARQLAR queries in order to study the
performance of the new language.
We begin by defining the semantics of SPARQLAR and the complexity of query
evaluation. We then present a query processing technique for evaluating SPARQLAR
queries based on a rewriting algorithm and prove its soundness and completeness.
During the evaluation of a SPARQLAR query we generate multiple SPARQL 1.1
queries that are evaluated against the dataset. Each such query will generate answers
with a cost that indicates their distance with respect to the exact form of the original
SPARQLAR query.
Our prototype implementation incorporates three optimisation techniques that
aim to enhance query execution performance: the first optimisation is a pre-computation
technique that caches the answers of parts of the queries generated by the rewriting
algorithm. These answers will then be reused to avoid the re-execution of those sub-queries. The second optimisation utilises a summary of the dataset to discard
queries that it is known will not return any answer. The third optimisation technique
uses the query containment concept to discard queries whose answers would
be returned by another query at the same or lower cost.
We conclude by conducting a performance study of the system on three different
RDF datasets: LUBM (Lehigh University Benchmark), YAGO and DBpedia