8 research outputs found
Hybrid Job-driven Scheduling for Virtual MapReduce Clusters
It is cost-efficient for a tenant with a limited budget to establish a
virtual MapReduce cluster by renting multiple virtual private servers (VPSs)
from a VPS provider. To provide an appropriate scheduling scheme for this type
of computing environment, we propose in this paper a hybrid job-driven
scheduling scheme (JoSS for short) from a tenant's perspective. JoSS provides
not only job level scheduling, but also map-task level scheduling and
reduce-task level scheduling. JoSS classifies MapReduce jobs based on job scale
and job type and designs an appropriate scheduling policy to schedule each
class of jobs. The goal is to improve data locality for both map tasks and
reduce tasks, avoid job starvation, and improve job execution performance. Two
variations of JoSS are further introduced to separately achieve a better
map-data locality and a faster task assignment. We conduct extensive
experiments to evaluate and compare the two variations with current scheduling
algorithms supported by Hadoop. The results show that the two variations
outperform the other tested algorithms in terms of map-data locality,
reduce-data locality, and network overhead without incurring significant
overhead. In addition, the two variations are separately suitable for different
MapReduce-workload scenarios and provide the best job performance among all
tested algorithms.Comment: 13 pages and 17 figure
RePAD: Real-time Proactive Anomaly Detection for Time Series
During the past decade, many anomaly detection approaches have been
introduced in different fields such as network monitoring, fraud detection, and
intrusion detection. However, they require understanding of data pattern and
often need a long off-line period to build a model or network for the target
data. Providing real-time and proactive anomaly detection for streaming time
series without human intervention and domain knowledge is highly valuable since
it greatly reduces human effort and enables appropriate countermeasures to be
undertaken before a disastrous damage, failure, or other harmful event occurs.
However, this issue has not been well studied yet. To address it, this paper
proposes RePAD, which is a Real-time Proactive Anomaly Detection algorithm for
streaming time series based on Long Short-Term Memory (LSTM). RePAD utilizes
short-term historic data points to predict and determine whether or not the
upcoming data point is a sign that an anomaly is likely to happen in the near
future. By dynamically adjusting the detection threshold over time, RePAD is
able to tolerate minor pattern change in time series and detect anomalies
either proactively or on time. Experiments based on two time series datasets
collected from the Numenta Anomaly Benchmark demonstrate that RePAD is able to
proactively detect anomalies and provide early warnings in real time without
human intervention and domain knowledge.Comment: 12 pages, 8 figures, the 34th International Conference on Advanced
Information Networking and Applications (AINA 2020
DALC: Distributed Automatic LSTM Customization for Fine-Grained Traffic Speed Prediction
Over the past decade, several approaches have been introduced for short-term
traffic prediction. However, providing fine-grained traffic prediction for
large-scale transportation networks where numerous detectors are geographically
deployed to collect traffic data is still an open issue. To address this issue,
in this paper, we formulate the problem of customizing an LSTM model for a
single detector into a finite Markov decision process and then introduce an
Automatic LSTM Customization (ALC) algorithm to automatically customize an LSTM
model for a single detector such that the corresponding prediction accuracy can
be as satisfactory as possible and the time consumption can be as low as
possible. Based on the ALC algorithm, we introduce a distributed approach
called Distributed Automatic LSTM Customization (DALC) to customize an LSTM
model for every detector in large-scale transportation networks. Our experiment
demonstrates that the DALC provides higher prediction accuracy than several
approaches provided by Apache Spark MLlib.Comment: 12 pages, 5 figures, the 34th International Conference on Advanced
Information Networking and Applications (AINA 2020), Springe
ReRe: A Lightweight Real-time Ready-to-Go Anomaly Detection Approach for Time Series
Anomaly detection is an active research topic in many different fields such
as intrusion detection, network monitoring, system health monitoring, IoT
healthcare, etc. However, many existing anomaly detection approaches require
either human intervention or domain knowledge, and may suffer from high
computation complexity, consequently hindering their applicability in
real-world scenarios. Therefore, a lightweight and ready-to-go approach that is
able to detect anomalies in real-time is highly sought-after. Such an approach
could be easily and immediately applied to perform time series anomaly
detection on any commodity machine. The approach could provide timely anomaly
alerts and by that enable appropriate countermeasures to be undertaken as early
as possible. With these goals in mind, this paper introduces ReRe, which is a
Real-time Ready-to-go proactive Anomaly Detection algorithm for streaming time
series. ReRe employs two lightweight Long Short-Term Memory (LSTM) models to
predict and jointly determine whether or not an upcoming data point is
anomalous based on short-term historical data points and two long-term
self-adaptive thresholds. Experiments based on real-world time-series datasets
demonstrate the good performance of ReRe in real-time anomaly detection without
requiring human intervention or domain knowledge.Comment: 10 pages, 9 figures, COMPSAC 202
Deep Data Locality on Apache Hadoop
The amount of data being collected in various areas such as social media, network, scientific instrument, mobile devices, and sensors is growing continuously, and the technology to process them is also advancing rapidly. One of the fundamental technologies to process big data is Apache Hadoop that has been adopted by many commercial products, such as InfoSphere by IBM, or Spark by Cloudera. MapReduce on Hadoop has been widely used in many data science applications. As a dominant big data processing platform, the performance of MapReduce on Hadoop system has a significant impact on the big data processing capability across multiple industries. Most of the research for improving the speed of big data analysis has been on Hadoop modules such as Hadoop common, Hadoop Distribute File System (HDFS), Hadoop Yet Another Resource Negotiator (YARN) and Hadoop MapReduce. In this research, we focused on data locality on HDFS to improve the performance of MapReduce. To reduce the amount of data transfer, MapReduce has been utilizing data locality. However, even though the majority of the processing cost occurs in the later stages, data locality has been utilized only in the early stages, which we call Shallow Data Locality (SDL). As a result, the benefit of data locality has not been fully realized. We have explored a new concept called Deep Data Locality (DDL) where the data is pre-arranged to maximize the locality in the later stages. Specifically, we introduce two implementation methods of the DDL, i.e., block-based DDL and key-based DDL.
In block-based DDL, the data blocks are pre-arranged to reduce the block copying time in two ways. First the RLM blocks are eliminated. Under the conventional default block placement policy (DBPP), data blocks are randomly placed on any available slave nodes, requiring a copy of RLM (Rack-Local Map) blocks. In block-based DDL, blocks are placed to avoid RLMs to reduce the block copy time. Second, block-based DDL concentrates the blocks in a smaller number of nodes and reduces the data transfer time among them. We analyzed the block distribution status with the customer review data from TripAdvisor and measured the performances with Terasort Benchmark. Our test result shows that the execution times of Map and Shuffle have been improved by up to 25% and 31% respectively.
In key-based DDL, the input data is divided into several blocks and stored in HDFS before going into the Map stage. In comparison with conventional blocks that have random keys, our blocks have a unique key. This requires a pre-sorting of the key-value pairs, which can be done during ETL process. This eliminates some data movements in map, shuffle, and reduce stages, and thereby improves the performance. In our experiments, MapReduce with key-based DDL performed 21.9% faster than default MapReduce and 13.3% faster than MapReduce with block-based DDL. Additionally, key-based DDL can be combined with other methods to further improve the performance. When key-based DDL and block-based DDL are combined, the Hadoop performance went up by 34.4%.
In this research, we developed the MapReduce workflow models with a novel computational model. We developed a numerical simulator that integrates the computational models. The model faithfully predicts the Hadoop performance under various conditions