PhD ThesisModern big data processing systems are becoming very complex in terms of largescale, high-concurrency and multiple talents. Thus, many failures and performance
reductions only happen at run-time and are very difficult to capture. Moreover, some
issues may only be triggered when some components are executed. To analyze the root
cause of these types of issues, we have to capture the dependencies of each component
in real-time.
Big data processing systems, such as Hadoop and Spark, usually work in large-scale,
highly-concurrent, and multi-tenant environments that can easily cause hardware and
software malfunctions or failures, thereby leading to performance degradation. Several systems and methods exist to detect big data processing systems’ performance
degradation, perform root-cause analysis, and even overcome the issues causing such
degradation. However, these solutions focus on specific problems such as stragglers and
inefficient resource utilization. There is a lack of a generic and extensible framework
to support the real-time diagnosis of big data systems.
Performance diagnosis and prediction of big data systems are highly complex as these
frameworks are typically deployed in cloud data centers that are large-scale, highly
concurrent, and follows a multi-tenant model. Several factors, including hardware
heterogeneity, stochastic networks and application workloads may impact the performance of big data systems. The current state-of-the-art does not sufficiently address
the challenge of determining complex, usually stochastic and hidden relationships between these factors.
To handle performance diagnosis and evaluation of big data systems in cloud environments, this thesis proposes multilateral research towards monitoring and performance
diagnosis and prediction in cloud-based large-scale distributed systems by involving a
novel combination of an effective and efficient deployment pipeline.The key contributions of this dissertation are listed below:
- i -
• Designing a real-time big data monitoring system called SmartMonit that efficiently collects the runtime system information including computing resource
utilization and job execution information and then interacts the collected information with the Execution Graph modeled as directed acyclic graphs (DAGs).
• Developing AutoDiagn, an automated real-time diagnosis framework for big data
systems, that automatically detects performance degradation and inefficient resource utilization problems, while providing an online detection and semi-online
root-cause analysis for a big data system.
• Designing a novel root-cause analysis technique/system called BigPerf for big
data systems that analyzes and characterizes the performance of big data applications by incorporating Bayesian networks to determine uncertain and complex
relationships between performance related factors.
The key contributions of this dissertation are listed below:
- i -
• Designing a real-time big data monitoring system called SmartMonit that efficiently collects the runtime system information including computing resource
utilization and job execution information and then interacts the collected information with the Execution Graph modeled as directed acyclic graphs (DAGs).
• Developing AutoDiagn, an automated real-time diagnosis framework for big data
systems, that automatically detects performance degradation and inefficient resource utilization problems, while providing an online detection and semi-online
root-cause analysis for a big data system.
• Designing a novel root-cause analysis technique/system called BigPerf for big
data systems that analyzes and characterizes the performance of big data applications by incorporating Bayesian networks to determine uncertain and complex
relationships between performance related factors.
The key contributions of this dissertation are listed below:
- i -
• Designing a real-time big data monitoring system called SmartMonit that efficiently collects the runtime system information including computing resource
utilization and job execution information and then interacts the collected information with the Execution Graph modeled as directed acyclic graphs (DAGs).
• Developing AutoDiagn, an automated real-time diagnosis framework for big data
systems, that automatically detects performance degradation and inefficient resource utilization problems, while providing an online detection and semi-online
root-cause analysis for a big data system.
• Designing a novel root-cause analysis technique/system called BigPerf for big
data systems that analyzes and characterizes the performance of big data applications by incorporating Bayesian networks to determine uncertain and complex
relationships between performance related factors.State of the Republic of Turkey and the Turkish Ministry
of National Educatio