1,038 research outputs found
Cloud-based Fault Detection and Classification for Oil & Gas Industry
Oil & Gas industry relies on automated, mission-critical equipment and
complex systems built upon their interaction and cooperation. To assure
continuous operation and avoid any supervision, architects embed Distributed
Control Systems (DCS), a.k.a. Supervisory Control and Data Acquisition (SCADA)
systems, on top of their equipment to generate data, monitor state and make
critical online & offline decisions.
In this paper, we propose a new Lambda architecture for oil & gas industry
for unified data and analytical processing on data received from DCS, discuss
cloud integration issues and share our experiences with the implementation of
sensor fault-detection and classification modules inside the proposed
architecture.Comment: Part of DM4OG 2017 proceedings (arXiv:1705.03451
A MapReduce-based rotation forest classifier for epileptic seizure prediction
In this era, big data applications including biomedical are becoming
attractive as the data generation and storage is increased in the last years.
The big data processing to extract knowledge becomes challenging since the data
mining techniques are not adapted to the new requirements. In this study, we
analyse the EEG signals for epileptic seizure detection in the big data
scenario using Rotation Forest classifier. Specifically, MSPCA is used for
denoising, WPD is used for feature extraction and Rotation Forest is used for
classification in a MapReduce framework to correctly predict the epileptic
seizure. This paper presents a MapReduce-based distributed ensemble algorithm
for epileptic seizure prediction and trains a Rotation Forest on each dataset
in parallel using a cluster of computers. The results of MapReduce based
Rotation Forest show that the proposed framework reduces the training time
significantly while accomplishing a high level of performance in
classifications
Big Data Computing Using Cloud-Based Technologies, Challenges and Future Perspectives
The excessive amounts of data generated by devices and Internet-based sources
at a regular basis constitute, big data. This data can be processed and
analyzed to develop useful applications for specific domains. Several
mathematical and data analytics techniques have found use in this sphere. This
has given rise to the development of computing models and tools for big data
computing. However, the storage and processing requirements are overwhelming
for traditional systems and technologies. Therefore, there is a need for
infrastructures that can adjust the storage and processing capability in
accordance with the changing data dimensions. Cloud Computing serves as a
potential solution to this problem. However, big data computing in the cloud
has its own set of challenges and research issues. This chapter surveys the big
data concept, discusses the mathematical and data analytics techniques that can
be used for big data and gives taxonomy of the existing tools, frameworks and
platforms available for different big data computing models. Besides this, it
also evaluates the viability of cloud-based big data computing, examines
existing challenges and opportunities, and provides future research directions
in this field
Storage and Memory Characterization of Data Intensive Workloads for Bare Metal Cloud
As the cost-per-byte of storage systems dramatically decreases, SSDs are
finding their ways in emerging cloud infrastructure. Similar trend is happening
for main memory subsystem, as advanced DRAM technologies with higher capacity,
frequency and number of channels are deploying for cloud-scale solutions
specially for non-virtualized environment where cloud subscribers can exactly
specify the configuration of underling hardware. Given the performance
sensitivity of standard workloads to the memory hierarchy parameters, it is
important to understand the role of memory and storage for data intensive
workloads. In this paper, we investigate how the choice of DRAM (high-end vs
low-end) impacts the performance of Hadoop, Spark, and MPI based Big Data
workloads in the presence of different storage types on bare metal cloud.
Through a methodical experimental setup, we have analyzed the impact of DRAM
capacity, operating frequency, the number of channels, storage type, and
scale-out factors on the performance of these popular frameworks. Based on
micro-architectural analysis, we classified data-intensive workloads into three
groups namely I/O bound, compute bound, and memory bound. The characterization
results show that neither DRAM capacity, frequency, nor the number of channels
play a significant role on the performance of all studied Hadoop workloads as
they are mostly I/O bound. On the other hand, our results reveal that iterative
tasks (e.g. machine learning) in Spark and MPI are benefiting from a high-end
DRAM in particular high frequency and large number of channels, as they are
memory or compute bound. Our results show that using SSD PCIe cannot shift the
bottleneck from storage to memory, while it can change the workload behavior
from I/O bound to compute bound.Comment: 8 pages, research draf
A survey of systems for massive stream analytics
The immense growth of data demands switching from traditional data processing
solutions to systems, which can process a continuous stream of real time data.
Various applications employ stream processing systems to provide solutions to
emerging Big Data problems. Open-source solutions such as Storm, Spark
Streaming, and S4 are the attempts to answer key stream processing questions.
The recent introduction of real time stream processing commercial solutions
such as Amazon Kinesis, IBM Infosphere Stream reflect industry requirements.
The system and application related challenges to handle massive stream of real
time data analytics are an active field of research.
In this paper, we present a comparative analysis of the existing
state-of-the-art stream processing solutions. We also include various
application domains, which are transforming their business model to benefit
from these large scale stream processing systems
A Hierarchical Distributed Processing Framework for Big Image Data
This paper introduces an effective processing framework nominated ICP (Image
Cloud Processing) to powerfully cope with the data explosion in image
processing field. While most previous researches focus on optimizing the image
processing algorithms to gain higher efficiency, our work dedicates to
providing a general framework for those image processing algorithms, which can
be implemented in parallel so as to achieve a boost in time efficiency without
compromising the results performance along with the increasing image scale. The
proposed ICP framework consists of two mechanisms, i.e. SICP (Static ICP) and
DICP (Dynamic ICP). Specifically, SICP is aimed at processing the big image
data pre-stored in the distributed system, while DICP is proposed for dynamic
input. To accomplish SICP, two novel data representations named P-Image and
Big-Image are designed to cooperate with MapReduce to achieve more optimized
configuration and higher efficiency. DICP is implemented through a parallel
processing procedure working with the traditional processing mechanism of the
distributed system. Representative results of comprehensive experiments on the
challenging ImageNet dataset are selected to validate the capacity of our
proposed ICP framework over the traditional state-of-the-art methods, both in
time efficiency and quality of results
A Distributed Deep Representation Learning Model for Big Image Data Classification
This paper describes an effective and efficient image classification
framework nominated distributed deep representation learning model (DDRL). The
aim is to strike the balance between the computational intensive deep learning
approaches (tuned parameters) which are intended for distributed computing, and
the approaches that focused on the designed parameters but often limited by
sequential computing and cannot scale up. In the evaluation of our approach, it
is shown that DDRL is able to achieve state-of-art classification accuracy
efficiently on both medium and large datasets. The result implies that our
approach is more efficient than the conventional deep learning approaches, and
can be applied to big data that is too complex for parameter designing focused
approaches. More specifically, DDRL contains two main components, i.e., feature
extraction and selection. A hierarchical distributed deep representation
learning algorithm is designed to extract image statistics and a nonlinear
mapping algorithm is used to map the inherent statistics into abstract
features. Both algorithms are carefully designed to avoid millions of
parameters tuning. This leads to a more compact solution for image
classification of big data. We note that the proposed approach is designed to
be friendly with parallel computing. It is generic and easy to be deployed to
different distributed computing resources. In the experiments, the largescale
image datasets are classified with a DDRM implementation on Hadoop MapReduce,
which shows high scalability and resilience
A Survey on Big Data for Network Traffic Monitoring and Analysis
Network Traffic Monitoring and Analysis (NTMA) represents a key component for network management, especially to guarantee the correct operation of large-scale networks such as the Internet. As the complexity of Internet services and the volume of traffic continue to increase, it becomes difficult to design scalable NTMA applications. Applications such as traffic classification and policing require real-time and scalable approaches. Anomaly detection and security mechanisms require to quickly identify and react to unpredictable events while processing millions of heterogeneous events. At last, the system has to collect, store, and process massive sets of historical data for post-mortem analysis. Those are precisely the challenges faced by general big data approaches: Volume, Velocity, Variety, and Veracity. This survey brings together NTMA and big data. We catalog previous work on NTMA that adopt big data approaches to understand to what extent the potential of big data is being explored in NTMA. This survey mainly focuses on approaches and technologies to manage the big NTMA data, additionally briefly discussing big data analytics (e.g., machine learning) for the sake of NTMA. Finally, we provide guidelines for future work, discussing lessons learned, and research directions
CFM-BD: a distributed rule induction algorithm for building Compact Fuzzy Models in Big Data classification problems
Interpretability has always been a major concern for fuzzy rule-based
classifiers. The usage of human-readable models allows them to explain the
reasoning behind their predictions and decisions. However, when it comes to Big
Data classification problems, fuzzy rule-based classifiers have not been able
to maintain the good trade-off between accuracy and interpretability that has
characterized these techniques in non-Big Data environments. The most accurate
methods build too complex models composed of a large number of rules and fuzzy
sets, while those approaches focusing on interpretability do not provide
state-of-the-art discrimination capabilities. In this paper, we propose a new
distributed learning algorithm named CFM-BD to construct accurate and compact
fuzzy rule-based classification systems for Big Data. This method has been
specifically designed from scratch for Big Data problems and does not adapt or
extend any existing algorithm. The proposed learning process consists of three
stages: 1) pre-processing based on the probability integral transform theorem;
2) rule induction inspired by CHI-BD and Apriori algorithms; 3) rule selection
by means of a global evolutionary optimization. We conducted a complete
empirical study to test the performance of our approach in terms of accuracy,
complexity, and runtime. The results obtained were compared and contrasted with
four state-of-the-art fuzzy classifiers for Big Data (FBDT, FMDT, Chi-Spark-RS,
and CHI-BD). According to this study, CFM-BD is able to provide competitive
discrimination capabilities using significantly simpler models composed of a
few rules of less than 3 antecedents, employing 5 linguistic labels for all
variables.Comment: Appears in IEEE Transactions on Fuzzy System
A Survey of Parallel Sequential Pattern Mining
With the growing popularity of shared resources, large volumes of complex
data of different types are collected automatically. Traditional data mining
algorithms generally have problems and challenges including huge memory cost,
low processing speed, and inadequate hard disk space. As a fundamental task of
data mining, sequential pattern mining (SPM) is used in a wide variety of
real-life applications. However, it is more complex and challenging than other
pattern mining tasks, i.e., frequent itemset mining and association rule
mining, and also suffers from the above challenges when handling the
large-scale data. To solve these problems, mining sequential patterns in a
parallel or distributed computing environment has emerged as an important issue
with many applications. In this paper, an in-depth survey of the current status
of parallel sequential pattern mining (PSPM) is investigated and provided,
including detailed categorization of traditional serial SPM approaches, and
state of the art parallel SPM. We review the related work of parallel
sequential pattern mining in detail, including partition-based algorithms for
PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for
PSPM, and provide deep description (i.e., characteristics, advantages,
disadvantages and summarization) of these parallel approaches of PSPM. Some
advanced topics for PSPM, including parallel quantitative / weighted / utility
sequential pattern mining, PSPM from uncertain data and stream data, hardware
acceleration for PSPM, are further reviewed in details. Besides, we review and
provide some well-known open-source software of PSPM. Finally, we summarize
some challenges and opportunities of PSPM in the big data era.Comment: Accepted by ACM Trans. on Knowl. Discov. Data, 33 page
- …