752 research outputs found
A Combined Analytical Modeling Machine Learning Approach for Performance Prediction of MapReduce Jobs in Hadoop Clusters
Nowadays MapReduce and its open source implementation, Apache Hadoop, are the most widespread solutions for handling massive dataset on clusters of commodity hardware. At the expense of a somewhat reduced performance in comparison to HPC technologies, the MapReduce framework provides fault tolerance and automatic parallelization without any efforts by developers. Since in many cases Hadoop is adopted to support business critical activities, it is often important to predict with fair confidence the execution time of submitted jobs, for instance when SLAs are established with end-users. In this work, we propose and validate a hybrid approach exploiting both queuing networks and support vector regression, in order to achieve a good accuracy without too many costly experiments on a real setup. The experimental results show how the proposed approach attains a 21% improvement in accuracy over applying machine learning techniques without any support from analytical models
Predicting Scheduling Failures in the Cloud
Cloud Computing has emerged as a key technology to deliver and manage
computing, platform, and software services over the Internet. Task scheduling
algorithms play an important role in the efficiency of cloud computing services
as they aim to reduce the turnaround time of tasks and improve resource
utilization. Several task scheduling algorithms have been proposed in the
literature for cloud computing systems, the majority relying on the
computational complexity of tasks and the distribution of resources. However,
several tasks scheduled following these algorithms still fail because of
unforeseen changes in the cloud environments. In this paper, using tasks
execution and resource utilization data extracted from the execution traces of
real world applications at Google, we explore the possibility of predicting the
scheduling outcome of a task using statistical models. If we can successfully
predict tasks failures, we may be able to reduce the execution time of jobs by
rescheduling failed tasks earlier (i.e., before their actual failing time). Our
results show that statistical models can predict task failures with a precision
up to 97.4%, and a recall up to 96.2%. We simulate the potential benefits of
such predictions using the tool kit GloudSim and found that they can improve
the number of finished tasks by up to 40%. We also perform a case study using
the Hadoop framework of Amazon Elastic MapReduce (EMR) and the jobs of a gene
expression correlations analysis study from breast cancer research. We find
that when extending the scheduler of Hadoop with our predictive models, the
percentage of failed jobs can be reduced by up to 45%, with an overhead of less
than 5 minutes
Performance Prediction of Cloud-Based Big Data Applications
Big data analytics have become widespread as a means to extract knowledge from large datasets. Yet, the heterogeneity and irregular- ity usually associated with big data applications often overwhelm the existing software and hardware infrastructures. In such con- text, the exibility and elasticity provided by the cloud computing paradigm o er a natural approach to cost-e ectively adapting the allocated resources to the application’s current needs. However, these same characteristics impose extra challenges to predicting the performance of cloud-based big data applications, a key step to proper management and planning. This paper explores three modeling approaches for performance prediction of cloud-based big data applications. We evaluate two queuing-based analytical models and a novel fast ad hoc simulator in various scenarios based on di erent applications and infrastructure setups. The three ap- proaches are compared in terms of prediction accuracy, nding that our best approaches can predict average application execution times with 26% relative error in the very worst case and about 7% on average
Geometric Approaches to Big Data Modeling and Performance Prediction
Big Data frameworks (e.g., Spark) have many configuration parameters, such as memory size, CPU allocation, and the number of nodes (parallelism). Regular users and even expert administrators struggle to understand the relationship between different parameter configurations and the overall performance of the system. In this work, we address this challenge by proposing a performance prediction framework to build performance models with varied configurable parameters on Spark. We take inspiration from the field of Computational Geometry to construct a d-dimensional mesh using Delaunay Triangulation over a selected set of features. From this mesh, we predict execution time for unknown feature configurations. To minimize the time and resources spent in building a model, we propose an adaptive sampling technique to allow us to collect as few training points as required. Our evaluation on a cluster of computers using several workloads shows that our prediction error is lower than the state-of-art methods while having fewer samples to train
Performance modelling, analysis and prediction of Spark jobs in Hadoop cluster : a thesis by publications presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical & Computational Sciences, Massey University, Auckland, New Zealand
Big Data frameworks have received tremendous attention from the industry and from academic research over the past decade. The advent of distributed computing frameworks such as Hadoop MapReduce and Spark are powerful frameworks that offer an efficient solution for analysing large-scale datasets running under the Hadoop cluster. Spark has been established as one of the most popular large-scale data processing engines because of its speed, low latency in-memory computation, and advanced analytics.
Spark computational performance heavily depends on the selection of suitable parameters, and the configuration of these parameters is a challenging task. Although Spark has default parameters and can deploy applications without much effort, a significant drawback of default parameter selection is that it is not always the best for cluster performance. A major limitation for Spark performance prediction using existing models is that it requires either large input data or system configuration that is time-consuming. Therefore, an analytical model could be a better solution for performance prediction and for establishing appropriate job configurations.
This thesis proposes two distinct parallelisation models for performance prediction: the 2D-Plate model and the Fully-Connected Node model. Both models were constructed based on serial boundaries for a certain arrangement of executors and size of the data. In order to evaluate the cluster performance, various HiBench workloads were used, and workload’s empirical data were fitted with the models for performance prediction analysis. The developed models were benchmarked with the existing models such as Amdahl’s, Gustafson, ERNEST, and machine learning. Our experimental results show that the two proposed models can quickly and accurately predict performance in terms of runtime, and they can outperform the accuracy of machine learning models when extrapolating predictions
Learning-based Automatic Parameter Tuning for Big Data Analytics Frameworks
Big data analytics frameworks (BDAFs) have been widely used for data
processing applications. These frameworks provide a large number of
configuration parameters to users, which leads to a tuning issue that
overwhelms users. To address this issue, many automatic tuning approaches have
been proposed. However, it remains a critical challenge to generate enough
samples in a high-dimensional parameter space within a time constraint. In this
paper, we present AutoTune--an automatic parameter tuning system that aims to
optimize application execution time on BDAFs. AutoTune first constructs a
smaller-scale testbed from the production system so that it can generate more
samples, and thus train a better prediction model, under a given time
constraint. Furthermore, the AutoTune algorithm produces a set of samples that
can provide a wide coverage over the high-dimensional parameter space, and
searches for more promising configurations using the trained prediction model.
AutoTune is implemented and evaluated using the Spark framework and HiBench
benchmark deployed on a public cloud. Extensive experimental results illustrate
that AutoTune improves on default configurations by 63.70% on average, and on
the five state-of-the-art tuning algorithms by 6%-23%.Comment: 12 pages, submitted to IEEE BigData 201
Exploring and Evaluating the Scalability and Efficiency of Apache Spark using Educational Datasets
Research into the combination of data mining and machine learning technology with web-based education systems (known as education data mining, or EDM) is becoming imperative in order to enhance the quality of education by moving beyond traditional methods. With the worldwide growth of the Information Communication Technology (ICT), data are becoming available at a significantly large volume, with high velocity and extensive variety. In this thesis, four popular data mining methods are applied to Apache Spark, using large volumes of datasets from Online Cognitive Learning Systems to explore the scalability and efficiency of Spark. Various volumes of datasets are tested on Spark MLlib with different running configurations and parameter tunings. The thesis convincingly presents useful strategies for allocating computing resources and tuning to take full advantage of the in-memory system of Apache Spark to conduct the tasks of data mining and machine learning. Moreover, it offers insights that education experts and data scientists can use to manage and improve the quality of education, as well as to analyze and discover hidden knowledge in the era of big data
Overcoming Challenges in Predictive Modeling of Laser-Plasma Interaction Scenarios. The Sinuous Route from Advanced Machine Learning to Deep Learning
The interaction of ultrashort and intense laser pulses with solid targets and dense plasmas is a rapidly developing area of physics, this being mostly due to the significant advancements in laser technology. There is, thus, a growing interest in diagnosing as accurately as possible the numerous phenomena related to the absorption and reflection of laser radiation. At the same time, envisaged experiments are in high demand of increased accuracy simulation software. As laser-plasma interaction modelings are experiencing a transition from computationally-intensive to data-intensive problems, traditional codes employed so far are starting to show their limitations. It is in this context that predictive modelings of laser-plasma interaction experiments are bound to reshape the definition of simulation software. This chapter focuses an entire class of predictive systems incorporating big data, advanced machine learning algorithms and deep learning, with improved accuracy and speed. Making use of terabytes of already available information (literature as well as simulation and experimental data) these systems enable the discovery and understanding of various physical phenomena occurring during interaction, hence allowing researchers to set up controlled experiments at optimal parameters. A comparative discussion in terms of challenges, advantages, bottlenecks, performances and suitability of laser-plasma interaction predictive systems is ultimately provided
- …