112,680 research outputs found

    A Distributed Stream Processing Middleware Framework for Real-Time Analysis of Heterogeneous Data on Big Data Platform: Case of Environmental Monitoring

    Get PDF
    ArticleIn recent years, the application and wide adoption of Internet of Things (IoT)-based technologies have increased the proliferation of monitoring systems, which has consequently exponentially increased the amounts of heterogeneous data generated. Processing and analysing the massive amount of data produced is cumbersome and gradually moving from classical ‘batch’ processing—extract, transform, load (ETL) technique to real-time processing. For instance, in environmental monitoring and management domain, time-series data and historical dataset are crucial for prediction models. However, the environmental monitoring domain still utilises legacy systems, which complicates the real-time analysis of the essential data, integration with big data platforms and reliance on batch processing. Herein, as a solution, a distributed stream processing middleware framework for real-time analysis of heterogeneous environmental monitoring and management data is presented and tested on a cluster using open source technologies in a big data environment. The system ingests datasets from legacy systems and sensor data from heterogeneous automated weather systems irrespective of the data types to Apache Kafka topics using Kafka Connect APIs for processing by the Kafka streaming processing engine. The stream processing engine executes the predictive numerical models and algorithms represented in event processing (EP) languages for real-time analysis of the data streams. To prove the feasibility of the proposed framework, we implemented the system using a case study scenario of drought prediction and forecasting based on the Effective Drought Index (EDI) model. Firstly, we transform the predictive model into a form that could be executed by the streaming engine for real-time computing. Secondly, the model is applied to the ingested data streams and datasets to predict drought through persistent querying of the infinite streams to detect anomalies. As a conclusion of this study, a performance evaluation of the distributed stream processing middleware infrastructure is calculated to determine the real-time effectiveness of the framework

    QoS-aware Resource-utilisation Self-adaptive (QRS) Framework for Distributed Data Stream Management Systems

    Get PDF
    The last decade witnessed a vast number of Big Data applications in the science and industry fields alike. Such applications generate large amounts of streaming data and real-time event-based information. Such data needs to be analysed under the specific quality of service constraints, which must be done within extremely low latencies. Many distributed data stream processing approaches are based on the best-effort QoS principle that lack the capability of dynamic adaptation to the fluctuations in data input rates. Most of the proposed solutions tend to either drop some of the input data (load shedding) or degrade the level of QoS provided by the system. Another approach is to limit the data ingestion input rate using techniques like backpressure heartbeats, which can affect the worker nodes that causes an output delay. Such approaches are not suitable to handle certain types of mission-critical applications such as critical infrastructure surveillance, monitoring and signalling, vital health care monitoring, and military command and control streaming applications. This research presents a novel QoS-aware, Resource-utilisation Self-adaptive (QRS) Framework for managing data stream processing systems. The framework proposes a comprehensive usage model that encompasses proactive operations followed by simultaneous prompt actions. The simultaneous prompt actions instantly collect and analyse the performance and QoS metrics along with running data streams, ensuring that data does not lose its current values, whereas the proactive operations construct the prediction model that anticipate QoS violations and performance degradation in the system. The model triggers essential decision process for dynamic tuning of resources or adapting a new scheduling strategy. A proof of concept model was built that accurately represents the working conditions of the distributed data stream management ecosystem. The proposed framework is validated and verified. The framework’s several components were fully implemented over the emerging and prevalent distributed data streaming processing system, Apache Storm. The framework performs accurate prediction up to 81% about the system’s capacity to handle data load and input rate. The accuracy reaches up to 100% by incorporating abnormal detection techniques. Moreover, the framework performs well compared with the default round-robin and resource-aware schedulers within Storm. It provides a better ability to handle high data rates by re-balancing the topology and re-scheduling resources based on the prediction models well ahead of any congestion or QoS degradation

    Evolving Large-Scale Data Stream Analytics based on Scalable PANFIS

    Full text link
    Many distributed machine learning frameworks have recently been built to speed up the large-scale data learning process. However, most distributed machine learning used in these frameworks still uses an offline algorithm model which cannot cope with the data stream problems. In fact, large-scale data are mostly generated by the non-stationary data stream where its pattern evolves over time. To address this problem, we propose a novel Evolving Large-scale Data Stream Analytics framework based on a Scalable Parsimonious Network based on Fuzzy Inference System (Scalable PANFIS), where the PANFIS evolving algorithm is distributed over the worker nodes in the cloud to learn large-scale data stream. Scalable PANFIS framework incorporates the active learning (AL) strategy and two model fusion methods. The AL accelerates the distributed learning process to generate an initial evolving large-scale data stream model (initial model), whereas the two model fusion methods aggregate an initial model to generate the final model. The final model represents the update of current large-scale data knowledge which can be used to infer future data. Extensive experiments on this framework are validated by measuring the accuracy and running time of four combinations of Scalable PANFIS and other Spark-based built in algorithms. The results indicate that Scalable PANFIS with AL improves the training time to be almost two times faster than Scalable PANFIS without AL. The results also show both rule merging and the voting mechanisms yield similar accuracy in general among Scalable PANFIS algorithms and they are generally better than Spark-based algorithms. In terms of running time, the Scalable PANFIS training time outperforms all Spark-based algorithms when classifying numerous benchmark datasets.Comment: 20 pages, 5 figure
    • …
    corecore