Search CORE

1,038 research outputs found

Cloud-based Fault Detection and Classification for Oil & Gas Industry

Author: Ari Ismail
Bakir Mustafa
Khodabakhsh Athar
Publication venue
Publication date: 11/05/2017
Field of study

Oil & Gas industry relies on automated, mission-critical equipment and complex systems built upon their interaction and cooperation. To assure continuous operation and avoid any supervision, architects embed Distributed Control Systems (DCS), a.k.a. Supervisory Control and Data Acquisition (SCADA) systems, on top of their equipment to generate data, monitor state and make critical online & offline decisions. In this paper, we propose a new Lambda architecture for oil & gas industry for unified data and analytical processing on data received from DCS, discuss cloud integration issues and share our experiences with the implementation of sensor fault-detection and classification modules inside the proposed architecture.Comment: Part of DM4OG 2017 proceedings (arXiv:1705.03451

arXiv.org e-Print Archive

A MapReduce-based rotation forest classifier for epileptic seizure prediction

Author: Jukic Samed
Subasi Abdulhamit
Publication venue
Publication date: 17/12/2017
Field of study

In this era, big data applications including biomedical are becoming attractive as the data generation and storage is increased in the last years. The big data processing to extract knowledge becomes challenging since the data mining techniques are not adapted to the new requirements. In this study, we analyse the EEG signals for epileptic seizure detection in the big data scenario using Rotation Forest classifier. Specifically, MSPCA is used for denoising, WPD is used for feature extraction and Rotation Forest is used for classification in a MapReduce framework to correctly predict the epileptic seizure. This paper presents a MapReduce-based distributed ensemble algorithm for epileptic seizure prediction and trains a Rotation Forest on each dataset in parallel using a cluster of computers. The results of MapReduce based Rotation Forest show that the proposed framework reduces the training time significantly while accomplishing a high level of performance in classifications

arXiv.org e-Print Archive

Big Data Computing Using Cloud-Based Technologies, Challenges and Future Perspectives

Author: Alam Mansaf
Khan Samiya
Shakil Kashish Ara
Publication venue
Publication date: 24/11/2017
Field of study

The excessive amounts of data generated by devices and Internet-based sources at a regular basis constitute, big data. This data can be processed and analyzed to develop useful applications for specific domains. Several mathematical and data analytics techniques have found use in this sphere. This has given rise to the development of computing models and tools for big data computing. However, the storage and processing requirements are overwhelming for traditional systems and technologies. Therefore, there is a need for infrastructures that can adjust the storage and processing capability in accordance with the changing data dimensions. Cloud Computing serves as a potential solution to this problem. However, big data computing in the cloud has its own set of challenges and research issues. This chapter surveys the big data concept, discusses the mathematical and data analytics techniques that can be used for big data and gives taxonomy of the existing tools, frameworks and platforms available for different big data computing models. Besides this, it also evaluates the viability of cloud-based big data computing, examines existing challenges and opportunities, and provides future research directions in this field

arXiv.org e-Print Archive

Storage and Memory Characterization of Data Intensive Workloads for Bare Metal Cloud

Author: Makrani Hosein Mohammadi
Publication venue
Publication date: 15/08/2018
Field of study

As the cost-per-byte of storage systems dramatically decreases, SSDs are finding their ways in emerging cloud infrastructure. Similar trend is happening for main memory subsystem, as advanced DRAM technologies with higher capacity, frequency and number of channels are deploying for cloud-scale solutions specially for non-virtualized environment where cloud subscribers can exactly specify the configuration of underling hardware. Given the performance sensitivity of standard workloads to the memory hierarchy parameters, it is important to understand the role of memory and storage for data intensive workloads. In this paper, we investigate how the choice of DRAM (high-end vs low-end) impacts the performance of Hadoop, Spark, and MPI based Big Data workloads in the presence of different storage types on bare metal cloud. Through a methodical experimental setup, we have analyzed the impact of DRAM capacity, operating frequency, the number of channels, storage type, and scale-out factors on the performance of these popular frameworks. Based on micro-architectural analysis, we classified data-intensive workloads into three groups namely I/O bound, compute bound, and memory bound. The characterization results show that neither DRAM capacity, frequency, nor the number of channels play a significant role on the performance of all studied Hadoop workloads as they are mostly I/O bound. On the other hand, our results reveal that iterative tasks (e.g. machine learning) in Spark and MPI are benefiting from a high-end DRAM in particular high frequency and large number of channels, as they are memory or compute bound. Our results show that using SSD PCIe cannot shift the bottleneck from storage to memory, while it can change the workload behavior from I/O bound to compute bound.Comment: 8 pages, research draf

arXiv.org e-Print Archive

A survey of systems for massive stream analytics

Author: Hoque Mohammad A.
Singh Maninder Pal
Tarkoma Sasu
Publication venue
Publication date: 04/06/2016
Field of study

The immense growth of data demands switching from traditional data processing solutions to systems, which can process a continuous stream of real time data. Various applications employ stream processing systems to provide solutions to emerging Big Data problems. Open-source solutions such as Storm, Spark Streaming, and S4 are the attempts to answer key stream processing questions. The recent introduction of real time stream processing commercial solutions such as Amazon Kinesis, IBM Infosphere Stream reflect industry requirements. The system and application related challenges to handle massive stream of real time data analytics are an active field of research. In this paper, we present a comparative analysis of the existing state-of-the-art stream processing solutions. We also include various application domains, which are transforming their business model to benefit from these large scale stream processing systems

arXiv.org e-Print Archive

A Hierarchical Distributed Processing Framework for Big Image Data

Author: Cao Xiaochun
Chen Qi
Dong Le
He Ling
Liang Yan
Lin Zhiyu
lzquierdo Ebroul
Zhang Ning
Publication venue
Publication date: 02/07/2016
Field of study

This paper introduces an effective processing framework nominated ICP (Image Cloud Processing) to powerfully cope with the data explosion in image processing field. While most previous researches focus on optimizing the image processing algorithms to gain higher efficiency, our work dedicates to providing a general framework for those image processing algorithms, which can be implemented in parallel so as to achieve a boost in time efficiency without compromising the results performance along with the increasing image scale. The proposed ICP framework consists of two mechanisms, i.e. SICP (Static ICP) and DICP (Dynamic ICP). Specifically, SICP is aimed at processing the big image data pre-stored in the distributed system, while DICP is proposed for dynamic input. To accomplish SICP, two novel data representations named P-Image and Big-Image are designed to cooperate with MapReduce to achieve more optimized configuration and higher efficiency. DICP is implemented through a parallel processing procedure working with the traditional processing mechanism of the distributed system. Representative results of comprehensive experiments on the challenging ImageNet dataset are selected to validate the capacity of our proposed ICP framework over the traditional state-of-the-art methods, both in time efficiency and quality of results

arXiv.org e-Print Archive

A Distributed Deep Representation Learning Model for Big Image Data Classification

Author: Dong Le
He Ling
Lv Na
Mao Mengdie
Xie Shanshan
Zhang Qianni
Publication venue
Publication date: 02/07/2016
Field of study

This paper describes an effective and efficient image classification framework nominated distributed deep representation learning model (DDRL). The aim is to strike the balance between the computational intensive deep learning approaches (tuned parameters) which are intended for distributed computing, and the approaches that focused on the designed parameters but often limited by sequential computing and cannot scale up. In the evaluation of our approach, it is shown that DDRL is able to achieve state-of-art classification accuracy efficiently on both medium and large datasets. The result implies that our approach is more efficient than the conventional deep learning approaches, and can be applied to big data that is too complex for parameter designing focused approaches. More specifically, DDRL contains two main components, i.e., feature extraction and selection. A hierarchical distributed deep representation learning algorithm is designed to extract image statistics and a nonlinear mapping algorithm is used to map the inherent statistics into abstract features. Both algorithms are carefully designed to avoid millions of parameters tuning. This leads to a more compact solution for image classification of big data. We note that the proposed approach is designed to be friendly with parallel computing. It is generic and easy to be deployed to different distributed computing resources. In the experiments, the largescale image datasets are classified with a DDRM implementation on Hadoop MapReduce, which shows high scalability and resilience

arXiv.org e-Print Archive

A Survey on Big Data for Network Traffic Monitoring and Analysis

Author: Casas Pedro
Drago Idilio
Dralconzo Alessandro
Mellia Marco
Morichetta Andrea
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/03/2020
Field of study

Network Traffic Monitoring and Analysis (NTMA) represents a key component for network management, especially to guarantee the correct operation of large-scale networks such as the Internet. As the complexity of Internet services and the volume of traffic continue to increase, it becomes difficult to design scalable NTMA applications. Applications such as traffic classification and policing require real-time and scalable approaches. Anomaly detection and security mechanisms require to quickly identify and react to unpredictable events while processing millions of heterogeneous events. At last, the system has to collect, store, and process massive sets of historical data for post-mortem analysis. Those are precisely the challenges faced by general big data approaches: Volume, Velocity, Variety, and Veracity. This survey brings together NTMA and big data. We catalog previous work on NTMA that adopt big data approaches to understand to what extent the potential of big data is being explored in NTMA. This survey mainly focuses on approaches and technologies to manage the big NTMA data, additionally briefly discussing big data analytics (e.g., machine learning) for the sake of NTMA. Finally, we provide guidelines for future work, discussing lessons learned, and research directions

arXiv.org e-Print Archive

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

CFM-BD: a distributed rule induction algorithm for building Compact Fuzzy Models in Big Data classification problems

Author: Barrenechea Edurne
Bustince Humberto
Elkano Mikel
Galar Mikel
Sanz Jose
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/02/2019
Field of study

Interpretability has always been a major concern for fuzzy rule-based classifiers. The usage of human-readable models allows them to explain the reasoning behind their predictions and decisions. However, when it comes to Big Data classification problems, fuzzy rule-based classifiers have not been able to maintain the good trade-off between accuracy and interpretability that has characterized these techniques in non-Big Data environments. The most accurate methods build too complex models composed of a large number of rules and fuzzy sets, while those approaches focusing on interpretability do not provide state-of-the-art discrimination capabilities. In this paper, we propose a new distributed learning algorithm named CFM-BD to construct accurate and compact fuzzy rule-based classification systems for Big Data. This method has been specifically designed from scratch for Big Data problems and does not adapt or extend any existing algorithm. The proposed learning process consists of three stages: 1) pre-processing based on the probability integral transform theorem; 2) rule induction inspired by CHI-BD and Apriori algorithms; 3) rule selection by means of a global evolutionary optimization. We conducted a complete empirical study to test the performance of our approach in terms of accuracy, complexity, and runtime. The results obtained were compared and contrasted with four state-of-the-art fuzzy classifiers for Big Data (FBDT, FMDT, Chi-Spark-RS, and CHI-BD). According to this study, CFM-BD is able to provide competitive discrimination capabilities using significantly simpler models composed of a few rules of less than 3 antecedents, employing 5 linguistic labels for all variables.Comment: Appears in IEEE Transactions on Fuzzy System

arXiv.org e-Print Archive

A Survey of Parallel Sequential Pattern Mining

Author: Chao Han-Chieh
Fournier-Viger Philippe
Gan Wensheng
Lin Jerry Chun-Wei
Yu Philip S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/04/2019
Field of study

With the growing popularity of shared resources, large volumes of complex data of different types are collected automatically. Traditional data mining algorithms generally have problems and challenges including huge memory cost, low processing speed, and inadequate hard disk space. As a fundamental task of data mining, sequential pattern mining (SPM) is used in a wide variety of real-life applications. However, it is more complex and challenging than other pattern mining tasks, i.e., frequent itemset mining and association rule mining, and also suffers from the above challenges when handling the large-scale data. To solve these problems, mining sequential patterns in a parallel or distributed computing environment has emerged as an important issue with many applications. In this paper, an in-depth survey of the current status of parallel sequential pattern mining (PSPM) is investigated and provided, including detailed categorization of traditional serial SPM approaches, and state of the art parallel SPM. We review the related work of parallel sequential pattern mining in detail, including partition-based algorithms for PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for PSPM, and provide deep description (i.e., characteristics, advantages, disadvantages and summarization) of these parallel approaches of PSPM. Some advanced topics for PSPM, including parallel quantitative / weighted / utility sequential pattern mining, PSPM from uncertain data and stream data, hardware acceleration for PSPM, are further reviewed in details. Besides, we review and provide some well-known open-source software of PSPM. Finally, we summarize some challenges and opportunities of PSPM in the big data era.Comment: Accepted by ACM Trans. on Knowl. Discov. Data, 33 page

arXiv.org e-Print Archive