112,680 research outputs found
A Distributed Stream Processing Middleware Framework for Real-Time Analysis of Heterogeneous Data on Big Data Platform: Case of Environmental Monitoring
ArticleIn recent years, the application and wide adoption of Internet of Things (IoT)-based
technologies have increased the proliferation of monitoring systems, which has consequently
exponentially increased the amounts of heterogeneous data generated. Processing and analysing
the massive amount of data produced is cumbersome and gradually moving from classical
‘batch’ processing—extract, transform, load (ETL) technique to real-time processing. For instance,
in environmental monitoring and management domain, time-series data and historical dataset are
crucial for prediction models. However, the environmental monitoring domain still utilises legacy
systems, which complicates the real-time analysis of the essential data, integration with big data
platforms and reliance on batch processing. Herein, as a solution, a distributed stream processing
middleware framework for real-time analysis of heterogeneous environmental monitoring and
management data is presented and tested on a cluster using open source technologies in a big data
environment. The system ingests datasets from legacy systems and sensor data from heterogeneous
automated weather systems irrespective of the data types to Apache Kafka topics using Kafka Connect
APIs for processing by the Kafka streaming processing engine. The stream processing engine executes
the predictive numerical models and algorithms represented in event processing (EP) languages
for real-time analysis of the data streams. To prove the feasibility of the proposed framework,
we implemented the system using a case study scenario of drought prediction and forecasting based
on the Effective Drought Index (EDI) model. Firstly, we transform the predictive model into a form
that could be executed by the streaming engine for real-time computing. Secondly, the model is
applied to the ingested data streams and datasets to predict drought through persistent querying of
the infinite streams to detect anomalies. As a conclusion of this study, a performance evaluation of
the distributed stream processing middleware infrastructure is calculated to determine the real-time
effectiveness of the framework
QoS-aware Resource-utilisation Self-adaptive (QRS) Framework for Distributed Data Stream Management Systems
The last decade witnessed a vast number of Big Data applications in the science and industry fields alike. Such applications generate large amounts of streaming data and real-time event-based information. Such data needs to be analysed under the specific quality of service constraints, which must be done within extremely low latencies.
Many distributed data stream processing approaches are based on the best-effort QoS principle that lack the capability of dynamic adaptation to the fluctuations in data input rates. Most of the proposed solutions tend to either drop some of the input data (load shedding) or degrade the level of QoS provided by the system. Another approach is to limit the data ingestion input rate using techniques like backpressure heartbeats, which can affect the worker nodes that causes an output delay. Such approaches are not suitable to handle certain types of mission-critical applications such as critical infrastructure surveillance, monitoring and signalling, vital health care monitoring, and military command and control streaming applications.
This research presents a novel QoS-aware, Resource-utilisation Self-adaptive (QRS) Framework for managing data stream processing systems. The framework proposes a comprehensive usage model that encompasses proactive operations followed by simultaneous prompt actions. The simultaneous prompt actions instantly collect and analyse the performance and QoS metrics along with running data streams, ensuring that data does not lose its current values, whereas the proactive operations construct the prediction model that anticipate QoS violations and performance degradation in the system. The model triggers essential decision process for dynamic tuning of resources or adapting a new scheduling strategy.
A proof of concept model was built that accurately represents the working conditions of the distributed data stream management ecosystem. The proposed framework is validated and verified. The framework’s several components were fully implemented over the emerging and prevalent distributed data streaming processing system, Apache Storm.
The framework performs accurate prediction up to 81% about the system’s capacity to handle data load and input rate. The accuracy reaches up to 100% by incorporating abnormal detection techniques. Moreover, the framework performs well compared with the default round-robin and resource-aware schedulers within Storm. It provides a better ability to handle high data rates by re-balancing the topology and re-scheduling resources based on the prediction models well ahead of any congestion
or QoS degradation
Evolving Large-Scale Data Stream Analytics based on Scalable PANFIS
Many distributed machine learning frameworks have recently been built to
speed up the large-scale data learning process. However, most distributed
machine learning used in these frameworks still uses an offline algorithm model
which cannot cope with the data stream problems. In fact, large-scale data are
mostly generated by the non-stationary data stream where its pattern evolves
over time. To address this problem, we propose a novel Evolving Large-scale
Data Stream Analytics framework based on a Scalable Parsimonious Network based
on Fuzzy Inference System (Scalable PANFIS), where the PANFIS evolving
algorithm is distributed over the worker nodes in the cloud to learn
large-scale data stream. Scalable PANFIS framework incorporates the active
learning (AL) strategy and two model fusion methods. The AL accelerates the
distributed learning process to generate an initial evolving large-scale data
stream model (initial model), whereas the two model fusion methods aggregate an
initial model to generate the final model. The final model represents the
update of current large-scale data knowledge which can be used to infer future
data. Extensive experiments on this framework are validated by measuring the
accuracy and running time of four combinations of Scalable PANFIS and other
Spark-based built in algorithms. The results indicate that Scalable PANFIS with
AL improves the training time to be almost two times faster than Scalable
PANFIS without AL. The results also show both rule merging and the voting
mechanisms yield similar accuracy in general among Scalable PANFIS algorithms
and they are generally better than Spark-based algorithms. In terms of running
time, the Scalable PANFIS training time outperforms all Spark-based algorithms
when classifying numerous benchmark datasets.Comment: 20 pages, 5 figure
- …