796 research outputs found
HYBRIDJOIN for near-real-time Data Warehousing
An important component of near-real-time data warehouses is the near-real-time integration layer. One important element in near-real-time data integration is the join of a continuous input data stream with a diskbased relation. For high-throughput streams, stream-based algorithms, such as Mesh Join (MESHJOIN), can be used. However, in MESHJOIN the performance of the algorithm is inversely proportional to the size of disk-based relation. The Index Nested Loop Join (INLJ) can be set up so that it processes stream input, and can deal with intermittences in the update stream but it has low throughput. This paper introduces a robust stream-based join algorithm called Hybrid Join (HYBRIDJOIN), which combines the two approaches. A theoretical result shows that HYBRIDJOIN is asymptotically as fast as the fastest of both algorithms. The authors present performance measurements of the implementation. In experiments using synthetic data based on a Zipfian distribution, HYBRIDJOIN performs significantly better for typical parameters of the Zipfian distribution, and in general performs in accordance with the theoretical model while the other two algorithms are unacceptably slow under different settings
Parallelizing Windowed Stream Joins in a Shared-Nothing Cluster
The availability of large number of processing nodes in a parallel and
distributed computing environment enables sophisticated real time processing
over high speed data streams, as required by many emerging applications.
Sliding window stream joins are among the most important operators in a stream
processing system. In this paper, we consider the issue of parallelizing a
sliding window stream join operator over a shared nothing cluster. We propose a
framework, based on fixed or predefined communication pattern, to distribute
the join processing loads over the shared-nothing cluster. We consider various
overheads while scaling over a large number of nodes, and propose solution
methodologies to cope with the issues. We implement the algorithm over a
cluster using a message passing system, and present the experimental results
showing the effectiveness of the join processing algorithm.Comment: 11 page
Towards Analytics Aware Ontology Based Access to Static and Streaming Data (Extended Version)
Real-time analytics that requires integration and aggregation of
heterogeneous and distributed streaming and static data is a typical task in
many industrial scenarios such as diagnostics of turbines in Siemens. OBDA
approach has a great potential to facilitate such tasks; however, it has a
number of limitations in dealing with analytics that restrict its use in
important industrial applications. Based on our experience with Siemens, we
argue that in order to overcome those limitations OBDA should be extended and
become analytics, source, and cost aware. In this work we propose such an
extension. In particular, we propose an ontology, mapping, and query language
for OBDA, where aggregate and other analytical functions are first class
citizens. Moreover, we develop query optimisation techniques that allow to
efficiently process analytical tasks over static and streaming data. We
implement our approach in a system and evaluate our system with Siemens turbine
data
Effective Use Methods for Continuous Sensor Data Streams in Manufacturing Quality Control
This work outlines an approach for managing sensor data streams of continuous numerical data in product manufacturing settings, emphasizing statistical process control, low computational and memory overhead, and saving information necessary to reduce the impact of nonconformance to quality specifications. While there is extensive literature, knowledge, and documentation about standard data sources and databases, the high volume and velocity of sensor data streams often makes traditional analysis unfeasible. To that end, an overview of data stream fundamentals is essential. An analysis of commonly used stream preprocessing and load shedding methods follows, succeeded by a discussion of aggregation procedures. Stream storage and querying systems are the next topics. Further, existing machine learning techniques for data streams are presented, with a focus on regression. Finally, the work describes a novel methodology for managing sensor data streams in which data stream management systems save and record aggregate data from small time intervals, and individual measurements from the stream that are nonconforming. The aggregates shall be continually entered into control charts and regressed on. To conserve memory, old data shall be periodically reaggregated at higher levels to reduce memory consumption
Parallelisation of a cache-based stream-relation join for a near-real-time data warehouse
Near real-time data warehousing is an important area of research, as business organisations want to analyse their businesses sales with minimal latency. Therefore, sales data generated by data sources need to reflect immediately in the data warehouse. This requires near-real-time transformation of the stream of sales data with a disk-based relation called master data in the staging area. For this purpose, a stream-relation join is required. The main problem in stream-relation joins is the different nature of inputs; stream data is fast and bursty, whereas the disk-based relation is slow due to high disk I/O cost. To resolve this problem, a famous algorithm CACHEJOIN (cache join) was published in the literature. The algorithm has two phases, the disk-probing phase and the stream-probing phase. These two phases execute sequentially; that means stream tuples wait unnecessarily due to the sequential execution of both phases. This limits the algorithm to exploiting CPU resources optimally. In this paper, we address this issue by presenting a robust algorithm called PCSRJ (parallelised cache-based stream relation join). The new algorithm enables the execution of both disk-probing and stream-probing phases of CACHEJOIN in parallel. The algorithm distributes the disk-based relation on two separate nodes and enables parallel execution of CACHEJOIN on each node. The algorithm also implements a strategy of splitting the stream data on each node depending on the relevant part of the relation. We developed a cost model for PCSRJ and validated it empirically. We compared the service rates of both algorithms using a synthetic dataset. Our experiments showed that PCSRJ significantly outperforms CACHEJOIN
EasyBDI: integração automática de big data e consultas analíticas de alto nível
Abstract The emergence of new areas, such as the internet of things, which require access
to the latest data for data analytics and decision-making environments,
created constraints for the execution of analytical queries on traditional data
warehouse architectures.
In addition, the increase of semi-structure and unstructured data led to the
creation of new databases to deal with these types of data, namely, NoSQL
databases. This led to the information being stored in several different systems,
each with more suitable characteristics for different use cases, which
created difficulties in accessing data that are now spread across various systems
with different models and characteristics.
In this work, a system capable of performing analytical queries in real time
on distributed and heterogeneous data sources is proposed: EasyBDI. The
system is capable of integrating data logically, without materializing data,
creating an overview of the data, thus offering an abstraction over the distribution
and heterogeneity of data sources. Queries are executed interactively
on data sources, which means that the most recent data will always be used
in queries. This system presents a user interface that helps in the configuration
of data sources, and automatically proposes a global schema that
presents a generic and simplified view of the data, which can be modified
by the user. The system allows the creation of multiple star schemas from
the global schema. Finally, analytical queries are also made through a user
interface that uses drag-and-drop elements.
EasyBDI is able to solve recent problems by using recent solutions, hiding
the details of several data sources, at the same time that allows users with
less knowledge of databases to also be able to perform real-time analytical
queries over distributed and heterogeneous data sources.O aparecimento de novas áreas, como a Internet das Coisas, que requerem o
acesso aos dados mais recentes para ambientes de tomada de decisão, criou
constrangimentos na execução de consultas analíticas usando as arquiteturas
tradicionais de data warehouses.
Adicionalmente, o aumento de dados semi-estruturados e não estruturados
levou a que outras bases de dados fossem criadas para lidar com esse tipo
de dados, nomeadamente bases NoSQL. Isto levou a que a informação seja
armazenada em sistemas com características distintas e especializados em
diferentes casos de uso, criando dificuldades no acesso aos dados que estão
agora espalhados por vários sistemas com modelos e características distintas.
Neste trabalho, propõe-se um sistema capaz de efetuar consultas analíticas
em tempo real sobre fontes de dados distribuídas e heterogéneas: o EasyBDI.
O sistema é capaz de integrar dados logicamente, sem materializar os dados,
criando uma vista geral dos dados que oferece uma abstração sobre a
distribuição e heterogeneidade das fontes de dados. As consultas são executadas
interativamente nas fontes de dados, o que significa que os dados
mais recentes serão sempre usados nas consultas. Este sistema apresenta
uma interface de utilizador que ajuda na configuração de fontes de dados, e
propõe automaticamente um esquema global que apresenta a vista genérica
e simplificada dos dados, podendo ser modificado pelo utilizador. O sistema
permite a criação de múltiplos esquema em estrela a partir do esquema
global. Por fim, a realização de consultas analíticas é feita também através
de uma interface de utilizador que recorre ao drag-and-drop de elementos.
O EasyBDI é capaz de resolver problemas recentes, utilizando também
soluções recentes, escondendo os detalhes de diversas fontes de dados, ao
mesmo tempo que permite que utilizadores com menos conhecimentos em
bases de dados possam também realizar consultas analíticas em tempo-real
sobre fontes de dados distribuídas e heterogéneas.Mestrado em Engenharia Informátic
A comparison of statistical machine learning methods in heartbeat detection and classification
In health care, patients with heart problems require quick responsiveness in a clinical setting or in the operating theatre. Towards that end, automated classification of heartbeats is vital as some heartbeat irregularities are time consuming to detect. Therefore, analysis of electro-cardiogram (ECG) signals is an active area of research. The methods proposed in the literature depend on the structure of a heartbeat cycle. In this paper, we use interval and amplitude based features together with a few samples from the ECG signal as a feature vector. We studied a variety of classification algorithms focused especially on a type of arrhythmia known as the ventricular ectopic fibrillation (VEB). We compare the performance of the classifiers against algorithms proposed in the literature and make recommendations regarding features, sampling rate, and choice of the classifier to apply in a real-time clinical setting. The extensive study is based on the MIT-BIH arrhythmia database. Our main contribution is the evaluation of existing classifiers over a range sampling rates, recommendation of a detection methodology to employ in a practical setting, and extend the notion of a mixture of experts to a larger class of algorithms
- …