175 research outputs found
On data skewness, stragglers, and MapReduce progress indicators
We tackle the problem of predicting the performance of MapReduce
applications, designing accurate progress indicators that keep programmers
informed on the percentage of completed computation time during the execution
of a job. Through extensive experiments, we show that state-of-the-art progress
indicators (including the one provided by Hadoop) can be seriously harmed by
data skewness, load unbalancing, and straggling tasks. This is mainly due to
their implicit assumption that the running time depends linearly on the input
size. We thus design a novel profile-guided progress indicator, called
NearestFit, that operates without the linear hypothesis assumption and exploits
a careful combination of nearest neighbor regression and statistical curve
fitting techniques. Our theoretical progress model requires fine-grained
profile data, that can be very difficult to manage in practice. To overcome
this issue, we resort to computing accurate approximations for some of the
quantities used in our model through space- and time-efficient data streaming
algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive
empirical assessment over the Amazon EC2 platform on a variety of real-world
benchmarks shows that NearestFit is practical w.r.t. space and time overheads
and that its accuracy is generally very good, even in scenarios where
competitors incur non-negligible errors and wide prediction fluctuations.
Overall, NearestFit significantly improves the current state-of-art on progress
analysis for MapReduce
PF-OLA: A High-Performance Framework for Parallel On-Line Aggregation
Online aggregation provides estimates to the final result of a computation
during the actual processing. The user can stop the computation as soon as the
estimate is accurate enough, typically early in the execution. This allows for
the interactive data exploration of the largest datasets. In this paper we
introduce the first framework for parallel online aggregation in which the
estimation virtually does not incur any overhead on top of the actual
execution. We define a generic interface to express any estimation model that
abstracts completely the execution details. We design a novel estimator
specifically targeted at parallel online aggregation. When executed by the
framework over a massive TPC-H instance, the estimator provides
accurate confidence bounds early in the execution even when the cardinality of
the final result is seven orders of magnitude smaller than the dataset size and
without incurring overhead.Comment: 36 page
Understanding Spark System Performance for Image Processing in a Heterogeneous Commodity Cluster
In recent years, Apache Spark has seen a widespread adoption in industries and institutions due to its
cache mechanism for faster Big Data analytics. However, the speed advantage Spark provides, especially in
a heterogeneous cluster environment, is not obtainable out-of-the-box; it requires the right combination of
configuration parameters from the myriads of parameters provided by Spark developers. Recognizing this
challenge, this thesis undertakes a study to provide insight on Spark performance particularly, regarding
the impact of choice parameter settings. These are parameters that are critical to fast job completion and
effective utilization of resources.
To this end, the study focuses on two specific example applications namely, flowerCounter and imageClustering,
for processing still image datasets of Canola plants collected during the Summer of 2016 from selected
plot fields using timelapse cameras in a heterogeneous Spark-clustered environments. These applications
were of initial interest to the Plant Phenotyping and Imaging Research Centre (P2IRC) at the University of
Saskatchewan. The P2IRC is responsible for developing systems that will aid fast analysis of large-scale seed
breeding to ensure global food security. The flowerCounter application estimates the count of flowers from
the images while the imageClustering application clusters images based on physical plant attributes. Two
clusters are used for the experiments: a 12-node and 3-node cluster (including a master node), with Hadoop
Distributed File System (HDFS) as the storage medium for the image datasets.
Experiments with the two case study applications demonstrate that increasing the number of tasks does
not always speed-up job processing due to increased communication overheads. Findings from other experiments
show that numerous tasks with one core per executor and small allocated memory limits parallelism
within an executor and result in inefficient use of cluster resources. Executors with large CPU and memory,
on the other hand, do not speed-up analytics due to processing delays and threads concurrency. Further
experimental results indicate that application processing time depends on input data storage in conjunction
with locality levels and executor run time is largely dominated by the disk I/O time especially, the read
time cost. With respect to horizontal node scaling, Spark scales with increasing homogeneous computing
nodes but the speed-up degrades with heterogeneous nodes. Finally, this study shows that the effectiveness
of speculative tasks execution in mitigating the impact of slow nodes varies for the applications
Stateful data-parallel processing
Democratisation of data means that more people than ever are involved in the data analysis process. This is beneficial—it brings domain-specific knowledge from broad fields—but data scientists do not have adequate tools to write algorithms and execute them at scale. Processing models of current data-parallel processing systems, designed for scalability and fault tolerance, are stateless. Stateless processing facilitates capturing parallelisation opportunities and hides fault tolerance. However, data scientists want to write stateful programs—with explicit state that they can update, such as matrices in machine learning algorithms—and are used to imperative-style languages. These programs struggle to execute with high-performance in stateless data-parallel systems.
Representing state explicitly makes data-parallel processing at scale challenging. To achieve scalability, state must be distributed and coordinated across machines. In the event of failures, state must be recovered to provide correct results. We introduce stateful data-parallel processing that addresses the previous challenges by: (i) representing state as a first-class citizen so that a system can manipulate it; (ii) introducing two distributed mutable state abstractions for scalability; and (iii) an integrated approach to scale out and fault tolerance that recovers large state—spanning the memory of multiple machines. To support imperative-style programs a static analysis tool analyses Java programs that manipulate state and translates them to a representation that can execute on SEEP, an implementation of a stateful data-parallel processing model. SEEP is evaluated with stateful Big Data applications and shows comparable or better performance than state-of-the-art stateless systems.Open Acces
A prescriptive analytics approach for energy efficiency in datacentres.
Given the evolution of Cloud Computing in recent years, users and clients adopting Cloud Computing for both personal and business needs have increased at an unprecedented scale. This has naturally led to the increased deployments and implementations of Cloud datacentres across the globe. As a consequence of this increasing adoption of Cloud Computing, Cloud datacentres are witnessed to be massive energy consumers and environmental polluters. Whilst the energy implications of Cloud datacentres are being addressed from various research perspectives, predicting the future trend and behaviours of workloads at the datacentres thereby reducing the active server resources is one particular dimension of green computing gaining the interests of researchers and Cloud providers. However, this includes various practical and analytical challenges imposed by the increased dynamism of Cloud systems. The behavioural characteristics of Cloud workloads and users are still not perfectly clear which restrains the reliability of the prediction accuracy of existing research works in this context. To this end, this thesis presents a comprehensive descriptive analytics of Cloud workload and user behaviours, uncovering the cause and energy related implications of Cloud Computing. Furthermore, the characteristics of Cloud workloads and users including latency levels, job heterogeneity, user dynamicity, straggling task behaviours, energy implications of stragglers, job execution and termination patterns and the inherent periodicity among Cloud workload and user behaviours have been empirically presented. Driven by descriptive analytics, a novel user behaviour forecasting framework has been developed, aimed at a tri-fold forecast of user behaviours including the session duration of users, anticipated number of submissions and the arrival trend of the incoming workloads. Furthermore, a novel resource optimisation framework has been proposed to avail the most optimum level of resources for executing jobs with reduced server energy expenditures and job terminations. This optimisation framework encompasses a resource estimation module to predict the anticipated resource consumption level for the arrived jobs and a classification module to classify tasks based on their resource intensiveness. Both the proposed frameworks have been verified theoretically and tested experimentally based on Google Cloud trace logs. Experimental analysis demonstrates the effectiveness of the proposed framework in terms of the achieved reliability of the forecast results and in reducing the server energy expenditures spent towards executing jobs at the datacentres.N/
Intelligent Straggler Mitigation in Massive-Scale Computing Systems
In order to satisfy increasing demands for Cloud services, modern computing systems are often massive in scale, typically consisting of hundreds to thousands of heterogeneous machine nodes. Parallel computing frameworks such as MapReduce are widely deployed over such cluster infrastructure to provide reliable yet prompt services to customers. However, complex characteristics of Cloud workloads, including multi-dimensional resource requirements and highly changeable system environments, e.g. dynamic node performance, are introducing new challenges to service providers in terms of both customer experience and system efficiency. One primary challenge is the straggler problem, whereby a small subset of the parallelized tasks take abnormally longer execution time in comparison with the siblings, leading to extended job response and potential late-timing failure.
The state-of-the-art approach to straggler mitigation is speculative execution. Although it has been deployed in several real-world systems with a variety of implementation optimizations, the analysis from this thesis has shown that speculative execution is often inefficient. According to various production tracelogs of data centers, the failure rate of speculative execution could be as high as 71%. Straggler mitigation is a complicated problem in its own nature: 1) stragglers may lead to different consequences to parallel job execution, possibly with different degrees of severity, 2) whether a task should be regarded as a straggler is highly subjective, depending upon different application and system conditions, 3) the efficiency of speculative execution would be improved if dynamic node performance could be modelled and predicted appropriately, and 4) there are other types of stragglers, e.g. those caused by data skews, that are beyond the capability of speculative execution.
This thesis starts with a quantitative and rigorous analysis of issues with stragglers, including their root-causes and impacts, the execution environment running them, and the limitations to their mitigation. Scientific principles of straggler mitigation are investigated and new algorithms are developed. An intelligent system for straggler mitigation is then designed and developed, being compatible with the majority of current parallel computing frameworks. Combined with historical data analysis and online adaptation, the system is capable of mitigating stragglers intelligently, dynamically judging a task as a straggler and handling it, avoiding current weak nodes, and dealing with data skew, a special type of straggler, with a dedicated method. Comprehensive analysis and evaluation of the system show that it is able to reduce job response time by up to 55%, as compared with the speculator used in the default YARN system, while the optimal improvement a speculative-based method may achieve is around 66% in theory. The system also achieves a much higher success rate of speculation than other production systems, up to 89%
Real-time performance diagnosis and evaluation of big data systems in cloud datacenters
PhD ThesisModern big data processing systems are becoming very complex in terms of largescale, high-concurrency and multiple talents. Thus, many failures and performance
reductions only happen at run-time and are very difficult to capture. Moreover, some
issues may only be triggered when some components are executed. To analyze the root
cause of these types of issues, we have to capture the dependencies of each component
in real-time.
Big data processing systems, such as Hadoop and Spark, usually work in large-scale,
highly-concurrent, and multi-tenant environments that can easily cause hardware and
software malfunctions or failures, thereby leading to performance degradation. Several systems and methods exist to detect big data processing systems’ performance
degradation, perform root-cause analysis, and even overcome the issues causing such
degradation. However, these solutions focus on specific problems such as stragglers and
inefficient resource utilization. There is a lack of a generic and extensible framework
to support the real-time diagnosis of big data systems.
Performance diagnosis and prediction of big data systems are highly complex as these
frameworks are typically deployed in cloud data centers that are large-scale, highly
concurrent, and follows a multi-tenant model. Several factors, including hardware
heterogeneity, stochastic networks and application workloads may impact the performance of big data systems. The current state-of-the-art does not sufficiently address
the challenge of determining complex, usually stochastic and hidden relationships between these factors.
To handle performance diagnosis and evaluation of big data systems in cloud environments, this thesis proposes multilateral research towards monitoring and performance
diagnosis and prediction in cloud-based large-scale distributed systems by involving a
novel combination of an effective and efficient deployment pipeline.The key contributions of this dissertation are listed below:
- i -
• Designing a real-time big data monitoring system called SmartMonit that efficiently collects the runtime system information including computing resource
utilization and job execution information and then interacts the collected information with the Execution Graph modeled as directed acyclic graphs (DAGs).
• Developing AutoDiagn, an automated real-time diagnosis framework for big data
systems, that automatically detects performance degradation and inefficient resource utilization problems, while providing an online detection and semi-online
root-cause analysis for a big data system.
• Designing a novel root-cause analysis technique/system called BigPerf for big
data systems that analyzes and characterizes the performance of big data applications by incorporating Bayesian networks to determine uncertain and complex
relationships between performance related factors.
The key contributions of this dissertation are listed below:
- i -
• Designing a real-time big data monitoring system called SmartMonit that efficiently collects the runtime system information including computing resource
utilization and job execution information and then interacts the collected information with the Execution Graph modeled as directed acyclic graphs (DAGs).
• Developing AutoDiagn, an automated real-time diagnosis framework for big data
systems, that automatically detects performance degradation and inefficient resource utilization problems, while providing an online detection and semi-online
root-cause analysis for a big data system.
• Designing a novel root-cause analysis technique/system called BigPerf for big
data systems that analyzes and characterizes the performance of big data applications by incorporating Bayesian networks to determine uncertain and complex
relationships between performance related factors.
The key contributions of this dissertation are listed below:
- i -
• Designing a real-time big data monitoring system called SmartMonit that efficiently collects the runtime system information including computing resource
utilization and job execution information and then interacts the collected information with the Execution Graph modeled as directed acyclic graphs (DAGs).
• Developing AutoDiagn, an automated real-time diagnosis framework for big data
systems, that automatically detects performance degradation and inefficient resource utilization problems, while providing an online detection and semi-online
root-cause analysis for a big data system.
• Designing a novel root-cause analysis technique/system called BigPerf for big
data systems that analyzes and characterizes the performance of big data applications by incorporating Bayesian networks to determine uncertain and complex
relationships between performance related factors.State of the Republic of Turkey and the Turkish Ministry
of National Educatio
- …