Search CORE

27 research outputs found

Bankrupt Covert Channel: Turning Network Predictability into Vulnerability

Author: Grot Boris
Katebzadeh M.R. Siavash
Petrov Plamen
Ustiugov Dmitrii
Publication venue
Publication date: 03/08/2020
Field of study

Recent years have seen a surge in the number of data leaks despite aggressive information-containment measures deployed by cloud providers. When attackers acquire sensitive data in a secure cloud environment, covert communication channels are a key tool to exfiltrate the data to the outside world. While the bulk of prior work focused on covert channels within a single CPU, they require the spy (transmitter) and the receiver to share the CPU, which might be difficult to achieve in a cloud environment with hundreds or thousands of machines. This work presents Bankrupt, a high-rate highly clandestine channel that enables covert communication between the spy and the receiver running on different nodes in an RDMA network. In Bankrupt, the spy communicates with the receiver by issuing RDMA network packets to a private memory region allocated to it on a different machine (an intermediary). The receiver similarly allocates a separate memory region on the same intermediary, also accessed via RDMA. By steering RDMA packets to a specific set of remote memory addresses, the spy causes deep queuing at one memory bank, which is the finest addressable internal unit of main memory. This exposes a timing channel that the receiver can listen on by issuing probe packets to addresses mapped to the same bank but in its own private memory region. Bankrupt channel delivers 74Kb/s throughput in CloudLab's public cloud while remaining undetectable to the existing monitoring capabilities, such as CPU and NIC performance counters.Comment: Published in WOOT 2020 co-located with USENIX Security 202

arXiv.org e-Print Archive

Edinburgh Research Explorer

Perph: A Workload Co-location Agent with Online Performance Prediction and Resource Inference

Author: Hu C
Ouyang J
Wo T
Xu J
Xue S
Yang R
Zhu J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/05/2021
Field of study

Striking a balance between improved cluster utilization and guaranteed application QoS is a long-standing research problem in cluster resource management. The majority of current solutions require a large number of sandboxed experimentation for different workload combinations and leverage them to predict possible interference for incoming workloads. This results in non-negligible time complexity that severely restricts its applicability to complex workload co-locations. The nature of pure offline profiling may also lead to model aging problem that drastically degrades the model precision. In this paper, we present Perph, a runtime agent on a per node basis, which decouples ML-based performance prediction and resource inference from centralized scheduler. We exploit the sensitivity of long-running applications to multi-resources for establishing a relationship between resource allocation and consequential performance. We use Online Gradient Boost Regression Tree (OGBRT) to enable the continuous model evolution. Once performance degradation is detected, resource inference is conducted to work out a proper slice of resources that will be reallocated to recover the target performance. The integration with Node Manager (NM) of Apache YARN shows that the throughput of Kafka data-streaming application is 2.0x and 1.82x times that of isolation execution schemes in native YARN and pure cgroup cpu subsystem. In TPC-C benchmarking, the throughput can also be improved by 35% and 23% respectively against YARN native and cgroup cpu subsystem

White Rose Research Online

가상화 환경을 위한 원격 메모리

Author: 조창연
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·컴퓨터공학부, 2021.8. Bernhard Egger.클라우드 환경은 거대한 연산 자원을 상시 가동할 필요 없고 원하는 순간 원하는 양의 대한 연산 비용만을 지불하면 되기 때문에, 최근 인공지능 및 빅데이터 연산의 유행으로 인해 그 수요가 크게 증가하고 있다. 이러한 클라우드 컴퓨팅의 도입으로인해 고객은 서버 유지에 대한 비용을 크게 절감할 수 있고 서비스 제공자는 연산 자원의 이용 효율을 극대화 할 수 있다. 이러한 시나리오에서 데이터센터 입장에서는 연산 자원 활용 효율을 개선하는 것이 중요한 목표가 된다. 특히 최근 폭증하고 있는 데이터 센터의 규모를 고려하면 작은 효율 개선으로도 막대한 경제적 가치를 창출 할 수 있다. 데이터 센터의 효율은 위치 선정, 구조 설계, 냉각 시스템, 하드웨어 구성 등등 다양한 요소들에 영향을 받지만, 이 논문에서는 특히 연산 및 메모리 자원을 관리하는 소프트웨어 설계 및 구현을 다룬다. 본 논문에서는 데이터 센터 효율 개선을 획기적으로 개선하는 두가지 소프트웨어 기반 기술을 제안한다. 첫 째로 가상화 환경을 위한 소프트웨어 기반 메모리 분리 시스템을 제안한다. 최근 고속 네트워크의 발전으로 인해 원격 메모리 접근 비용이 획기적으로 줄어 들었고, 이 논문에서는 고성능 네트워킹 하드웨어를 이용하여 원격 메모리 위에서 실행되는 가상 머신의 큰 성능 저하 없이 실행할 수 있음을 보인다. 제안된 기술을 QEMU/KVM 가상머신 하이퍼바이저를 통해 평가한 결과, 본 논문에서 제안한 기법은 기존 시스템 대비 원격 페이징에 대한 꼬리 지연시간을 98.2% 개선함을 보인다. 또한 랙 규모의 작업처리 시뮬레이션을 통한 실험에서, 제안된 시스템은 전체 작업 처리 시간을 기존 시스템 대비 40.9% 줄일 수 있음을 보인다. 두 번째로 원격 메모리를 이용하는 즉각적인 가상머신 이주 기법을 제안하다. 가상화 환경의 원격 메모리 활용에 대한 확장은 그것만으로 자원 이용률 향상에 대해 큰 기여를 하지만, 여전히 한 서버에서 여러 어플리케이션이 경쟁적으로 자원을 이용하는 경우 성능이 크게 저하 될 수 있다. 이 논문에서 제안하는 즉각적인 가상머신 이주 기법은 원격 메모리 상에서 아주 작은 메타데이터의 전송만으로 가상머신의 이주를 가능하게 하며, 메모리 상에 키와 값을 저장하는 데이터베이스 벤치마크를 실행하는 가상머신을 기반으로 한 평가에서 기존 기법대비 실질적인 서비스 중단시간을 최대 92.6% 개선함을 보인다.The raising importance of big data and artificial intelligence (AI) has led to an unprecedented shift in moving local computation into the cloud. One of the key drivers behind this transformation was the exploding cost of owning and maintaining large computing systems powerful enough to process these new workloads. Customers experience a reduced cost by renting only the required resources and only when needed, while data center operators benefit from efficiency at scale. A key factor in operating a profitable data center is a high overall utilization of its resources. Due to the scale of modern data centers, small improvements in efficiency translate to significant savings in the total cost of ownership (TCO). There are many important elements that constitute an efficient data center such as its location, architecture, cooling system, or the employed hardware. In this thesis, we focus on software-related aspects, namely the utilization of computational and memory resources. Reports from data centers operated by Alibaba and Google show that the overall resource utilization has stagnated at a level of around 50 to 60 percent over the past decade. This low average utilization is mostly attributable to peak demand-driven resource allocation despite the high variability of modern workloads in their resource usage. In other words, data centers today lack an efficient way to put idle resources that are reserved but not used to work. In this dissertation we present RackMem, a software-based solution to address the problem of low resource utilization through two main contributions. First, we introduce a disaggregated memory system tailored for virtual environments. We observe that virtual machines can use remote memory without noticeable performance degradation under moderate memory pressure on modern networking infrastructure. We implement a specialized remote paging system for QEMU/KVM that reduces the remote paging tail-latency by 98.2% in comparison to the state of the art. A job processing simulation at rack-scale shows that the total makespan can be reduced by 40.9% under our memory system. While seamless disaggregated memory helps to balance memory usage across nodes, individual nodes can still suffer overloaded resources if co-located workloads exhibit high resource usage at the same time. In a second contribution, we present a novel live migration technique for machines running on top of our remote paging system. Under this instant live migration technique, entire virtual machines can be migrated in as little as 100 milliseconds. An evaluation with in-memory key-value database workloads shows that the presented migration technique improves the state of the art by a wide margin in all key performance metrics. The presented software-based solutions lay the technical foundations that allow data center operators to significantly improve the utilization of their computational and memory resources. As future work, we propose new job schedulers and load balancers to make full use of these new technical foundations.Chapter 1. Introduction 1 1.1 Contributions of the Dissertation 3 Chapter 2. Background 5 2.1 Resource Disaggregation 5 2.2 Transparent Remote Paging 7 2.3 Remote Direct Memory Access (RDMA) 9 2.4 Live Migration of Virtual Machines 10 Chapter 3. RackMem Overview 13 3.1 RackMem Virtual Memory 13 3.2 RackMem Distributed Virtual Storage 14 3.3 RackMem Networking 15 3.4 Instant VM Live Migration 16 Chapter 4. Virtual Memory 17 4.1 Design Considerations for Achieving Low-latency 19 4.2 Pagefault handling 20 4.2.1 Fast-path and slow-path in the pagefault handler 21 4.2.2 State transition of RackVM page 23 4.3 Latency Hiding Techniques 25 4.4 Implementation 26 4.4.1 RackMem Virtual Memory Module 27 4.4.2 Dynamic Rebalancing of Local Memory 29 4.4.3 RackVM for Virtual Machines 29 4.4.4 Running Unmodified Applications 30 Chapter 5. RackMem Distributed Virtual Storage 31 5.1 The distributed Storage Abstraction 32 5.2 Memory Management 33 5.2.1 Remote memory allocation 33 5.2.2 Remote memory reclamation 33 5.3 Fault Tolerance 34 5.3.1 Fault-tolerance and Write-duplication 34 5.4 Multiple Storage Support in RackMem 36 5.5 Implementation 38 5.5.1 The Remote Memory Backend 38 5.5.2 Linux Demand Paging on RackDVS 39 Chapter 6. Networking 40 6.1 Design of RackNet 40 6.2 Implementation 41 6.2.1 RPC message layout 41 6.2.2 RackNet RPC Implementation 42 Chapter 7. Instant VM Live Migration 44 7.1 Motivation 45 7.1.1 The need for a tailored live migration technique 45 7.1.2 Software Bottlenecks 46 7.1.3 Utilizing workload variability 46 7.2 Design of Instant 47 7.2.1 Instant Region Migration 47 7.3 Implementation 48 7.3.1 Extension of RackVM for Instant 49 7.3.2 Instant region migration 49 7.3.3 Pre-fetch optimizations 51 7.3.4 Downtime optimizations 51 7.3.5 QEMU modification for Instant 52 Chapter 8. Evaluation - RackMem 53 8.1 Execution Environment 54 8.2 Pagefault Handler Latency 56 8.3 Single Application Performance 57 8.3.1 Batch-oriented Applications 58 8.3.2 Internal Pagesize and Performance 59 8.3.3 Write-duplication overhead 60 8.3.4 RackDVS slab size and performance 62 8.3.5 Latency-oriented Applications 63 8.3.6 Network Bandwidth Analysis 64 8.3.7 Dynamic Local Memory Partitioning 66 8.3.8 Rack-scale Job Processing Simulation 67 Chapter 9. Evaluation - Instant VM Live Migration 69 9.1 Experimental setup 69 9.2 Target Applications 70 9.3 Comparison targets 70 9.4 Database and client setups 71 9.5 Memory disaggregation scenarios 71 9.6.1 Time-to-responsiveness 71 9.6.2 Effective Downtime 73 9.6.3 Effect of Instant optimizations 75 Chapter 10. Conclusion 77 10.1 Future Directions 78 요약 89박

SNU Open Repository and Archive

Performance-Aware Speculative Resource Oversubscription for Large-Scale Clusters

Author: Garraghan P
Hu C
Li C
Peng H
Sun X
Wen Z
Wo T
Xu J
Yang R
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/01/2020
Field of study

It is a long-standing challenge to achieve a high degree of resource utilization in cluster scheduling. Resource oversubscription has become a common practice in improving resource utilization and cost reduction. However, current centralized approaches to oversubscription suffer from the issue with resource mismatch and fail to take into account other performance requirements, e.g., tail latency. In this article we present ROSE, a new resource management platform capable of conducting performance-aware resource oversubscription. ROSE allows latency-sensitive long-running applications (LRAs) to co-exist with computation-intensive batch jobs. Instead of waiting for resource allocation to be confirmed by the centralized scheduler, job managers in ROSE can independently request to launch speculative tasks within specific machines according to their suitability for oversubscription. Node agents of those machines can however, avoid any excessive resource oversubscription by means of a mechanism for admission control using multi-resource threshold control and performance-aware resource throttle. Experiments show that in case of mixed co-location of batch jobs and latency-sensitive LRAs, the CPU utilization and the disk utilization can reach 56.34 and 43.49 percent, respectively, but the 95th percentile of read latency in YCSB workloads only increases by 5.4 percent against the case of executing the LRAs alone

Lancaster E-Prints

White Rose Research Online

Machine Learning Defence Mechanism for Securing the Cloud Environment

Author: L Girish
Raviprakash M L
Publication venue: Innovative Scientific Research Publisher, Railway Station Road, Gandinagar, Karnataka
Publication date: 09/03/2023
Field of study

A computer paradigm known as ”cloud computing” offers end users on-demand, scalable, and measurable services. Today’s businesses rely heavily on computer technology for a variety of reasons, including cost savings, infrastructure, development platforms, data processing, data analytics, etc. The end users can access the cloud service providers’ (CSP) services from any location at any time using a web application. The protection of the cloud infrastructure is of the highest  significance, and several studies using a variety of technologies have been conducted to develop more effective defenses against cloud threats. In recent years, machine learning technology has shown to be more effective in securing the cloud environment. In recent years, machine learning technology has shown to be more effective in securing the cloud environment. To create models that can automate the process of identifying cloud threats with better accuracy than any other technology, machine learning algorithms are  trained  on  a  variety  of  real-world  datasets. In this study, various recent research publications that used machine learning as a defense mechanism against cloud threats are reviewed

International Journal of Advanced Scientific Innovation - IJASI

Proactive Interference-aware Resource Management in Deep Learning Training Cluster

Author: Yeung Ging-Fung
Publication venue: Lancaster University
Publication date: 01/01/2022
Field of study

Deep Learning (DL) applications are growing at an unprecedented rate across many domains, ranging from weather prediction, map navigation to medical imaging. However, training these deep learning models in large-scale compute clusters face substantial challenges in terms of low cluster resource utilisation and high job waiting time. State-of-the-art DL cluster resource managers are needed to increase GPU utilisation and maximise throughput. While co-locating DL jobs within the same GPU has been shown to be an effective means towards achieving this, co-location subsequently incurs performance interference resulting in job slowdown. We argue that effective workload placement can minimise DL cluster interference at scheduling runtime by understanding the DL workload characteristics and their respective hardware resource consumption. However, existing DL cluster resource managers reserve isolated GPUs to perform online profiling to directly measure GPU utilisation and kernel patterns for each unique submitted job. Such a feedback-based reactive approach results in additional waiting times as well as reduced cluster resource efficiency and availability. In this thesis, we propose Horus: an interference-aware and prediction-based DL cluster resource manager. Through empirically studying a series of microbenchmarks and DL workload co-location combinations across heterogeneous GPU hardware, we demonstrate the negative effects of performance interference when colocating DL workload, and identify GPU utilisation as a general proxy metric to determine good placement decisions. From these findings, we design Horus, which in contrast to existing approaches, proactively predicts GPU utilisation of heterogeneous DL workload extrapolated from the DL model computation graph features when performing placement decisions, removing the need for online profiling and isolated reserved GPUs. By conducting empirical experimentation within a medium-scale DL cluster as well as a large-scale trace-driven simulation of a production system, we demonstrate Horus improves cluster GPU utilisation, reduces cluster makespan and waiting time, and can scale to operate within hundreds of machines

Lancaster E-Prints

Edge-Facilitated Mobile Computing and Communication

Author: Zhou Pengyuan
Publication venue: 'University of Helsinki Libraries'
Publication date: 28/05/2020
Field of study

The proliferation of IoT devices and rapidly developing wireless techniques boost the data volume and service demand at the edge of the Internet. Meanwhile, increased requirement for low latency feedback has become a must for most popular mobile applications, e.g., Augmented Reality (AR), Virtual Reality (VR) and Connected Vehicles. To address these challenges, edge computing has emerged as an extensional solution for cloud computing. This thesis studies edge computing-facilitated mobile computing and communication systems. We first propose solutions to improve edge resource utilization regarding general edge systems. We present a mechanism to cluster user requests based on similarity for better Content Delivery Net- work (CDN) performance. This mechanism works directly on current CDN architecture and can be deployed incrementally. Then we extend the mechanism by adding cache resource grouping algorithm, so that the system directs similar requests to same servers and group those servers which receive similar requests. This iterative mechanism optimizes the edge utilization by concentrating the resource on similar requests to achieve higher cache hit ratio and computation efficiency. Thereafter, we present solutions for mobile edge systems specifically for three most promising use cases, i.e., Connected Vehicles, Mobile AR (MAR) and Smart city (traffic control). We explore the potential of edge computing in connected vehicular AR applications with real data sets. We design a lightweight edge system and data flow fit for general connected vehicular AR applications and implement a prototype. With an indoor test and real data set analysis, we find out that our system can improve the performance of vehicular AR applications with reasonable cost. To optimize the system, we formulate the problem of edge server allocation and task scheduling as a mutant multiprocessor scheduling problem and develop a two-stage edge-cloud decentralized algorithm as well as a centralized algorithm to schedule the offloading tasks on the fly. We conduct a raw road test and an extensive evaluation based on the road test results and large data sets from real world. The results show that our system improve at least twice the application performance comparing with cloud solutions. For MAR, we consider to offload tasks to multiple edge servers via multiple paths simultaneously to further improve the MAR performance. We develop a fast scheduling algorithm to split the workloads among the avail- able edge servers and show promising results with real implementations. At last, we explore the potential of combining edge computing and ma- chine learning techniques to realize intelligent traffic control by letting edge servers co-located with traffic lights learn the waiting traffic and adapt the light periods with reinforcement learning.Esineiden Internetin leviäminen ja nopeasti kehittyvät langattomat tekniikat lisäävät datan määrää ja palvelutarvetta Internetin reunalla. Samanaikaisesti lisääntyneestä alhaisen viiveen palautteen vaatimuksesta on tullut välttämätön suosituimpiin mobiilisovelluksiin, esim. lisättyyn todellisuuteen (AR), virtuaalitodellisuuteen (VR) ja yhdistettyihin ajoneuvoihin. Reunalaskenta on noussut pilvilaskennan rinnalle näihin haasteisiin vastaavaksi ratkaisuksi. Tässä väitöskirjassa tutkitaan laskennallisesti laajennettuja mobiililaskenta- ja viestintäjärjestelmiä. Ehdotamme ensin ratkaisuja reunaresurssien käytön parantamiseksi yleisten reunajärjestelmien suhteen. Esitämme mekanismin käyttäjien pyyntöjen klusterointiin perustuen samankaltaisuuteen sisällönjakeluverkon (CDN) suorituskyvyn parantamiseksi. Tämä mekanismi toimii suoraan nykyisessä CDN-arkkitehtuureissa ja voidaan ottaa käyttöön asteittain. Sitten laajennamme mekanismia lisäämällä välimuistiresurssien ryhmittelyalgoritmin siten, että järjestelmä ohjaa samankaltaiset pyynnöt samoille palvelimille ja ryhmittelee palvelimet pyyntöjen mukaan. Tämä iteratiivinen mekanismi optimoi reunakäytön keskittämällä resurssit samanlaisiin pyyntöihin suuremman välimuistin osumissuhteen ja laskentatehokkuuden saavuttamiseksi. Sen jälkeen esittelemme ratkaisuja liikkuviin reunajärjestelmiin erityisesti kolmeen lupaavimpaan käyttötapaukseen, ts. yhdistetyt ajoneuvot, laajennettu mobiilitodellisuus (MAR) ja älykäs kaupunki (erityisesti liikenteenohjaus). Tutkimme reunalaskennan mahdollisuuksia yhdistettyjen ajoneuvojen AR-sovelluksissa. Suunnittelemme kevyen reunajärjestelmän ja tiedonkulun, joka sopii yleisesti yhdistettyjen ajoneuvojen AR-sovelluksiin ja toteutamme prototyypin. Sisätilojen testin ja reaalimaailman datan avulla saamme selville, että järjestelmämme voi parantaa ajoneuvojen AR-sovellusten suorituskykyä kohtuullisin kustannuksin. Järjestelmän optimoimiseksi formuloimme reunapalvelimien allokoinnin ja tehtävien ajoituksen ongelman muuttuvana moniprosessorien skedulointiongelmana ja kehitämme kaksivaiheisen reunapilviin soveltuvan hajautetun algoritmin sekä keskitetyn algoritmin kuormansiirtotehtävien ajonaikaiseen ajoittamiseen. Suoritamme kokeellisen testin oikeassa ajossa ja datapohjaisen arvioinnin, joka perustuu tietestien tuloksiin ja todellisen maailman suuriin tietojoukkoihin. Tulokset osoittavat, että järjestelmämme parantaa merkittävästi sovelluksen suorituskykyä verrattuna pilviratkaisuihin. MAR:n osalta käsittelemme tehtävien lataamista useille reunapalvelimille useiden reittien kautta samanaikaisesti MAR:n suorituskyvyn parantamiseksi. Kehitämme nopean aikataulutusalgoritmin työkuormien jakamiseen käytettävissä olevien reunapalvelimien. Lopuksi tutkimme mahdollisuuksia yhdistää reunalaskenta ja koneoppimistekniikat älykkään liikennevalo-ohjauksen toteuttamiseksi liikennevaloihin sijoitetuilla reunapalvelimilla

Helsingin yliopiston digitaalinen arkisto

On the use of intelligent models towards meeting the challenges of the edge mesh

Author: Anagnostopoulos Christos
Karanika Anna
Kolomvatsos Kostas
Oikonomou Panagiotis
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/07/2021
Field of study

Nowadays, we are witnessing the advent of the Internet of Things (IoT) with numerous devices performing interactions between them or with their environment. The huge number of devices leads to huge volumes of data that demand the appropriate processing. The “legacy” approach is to rely on Cloud where increased computational resources can realize any desired processing. However, the need for supporting real-time applications requires a reduced latency in the provision of outcomes. Edge Computing (EC) comes as the “solver” of the latency problem. Various processing activities can be performed at EC nodes having direct connection with IoT devices. A number of challenges should be met before we conclude a fully automated ecosystem where nodes can cooperate or understand their status to efficiently serve applications. In this article, we perform a survey of the relevant research activities towards the vision of Edge Mesh (EM), i.e., a “cover” of intelligence upon the EC. We present the necessary hardware and discuss research outcomes in every aspect of EC/EM nodes functioning. We present technologies and theories adopted for data, tasks, and resource management while discussing how machine learning and optimization can be adopted in the domain

Enlighten

Real-time performance diagnosis and evaluation of big data systems in cloud datacenters

Author: Demirbaga Umit
Publication venue: Newcastle University
Publication date: 01/01/2022
Field of study

PhD ThesisModern big data processing systems are becoming very complex in terms of largescale, high-concurrency and multiple talents. Thus, many failures and performance reductions only happen at run-time and are very difficult to capture. Moreover, some issues may only be triggered when some components are executed. To analyze the root cause of these types of issues, we have to capture the dependencies of each component in real-time. Big data processing systems, such as Hadoop and Spark, usually work in large-scale, highly-concurrent, and multi-tenant environments that can easily cause hardware and software malfunctions or failures, thereby leading to performance degradation. Several systems and methods exist to detect big data processing systems’ performance degradation, perform root-cause analysis, and even overcome the issues causing such degradation. However, these solutions focus on specific problems such as stragglers and inefficient resource utilization. There is a lack of a generic and extensible framework to support the real-time diagnosis of big data systems. Performance diagnosis and prediction of big data systems are highly complex as these frameworks are typically deployed in cloud data centers that are large-scale, highly concurrent, and follows a multi-tenant model. Several factors, including hardware heterogeneity, stochastic networks and application workloads may impact the performance of big data systems. The current state-of-the-art does not sufficiently address the challenge of determining complex, usually stochastic and hidden relationships between these factors. To handle performance diagnosis and evaluation of big data systems in cloud environments, this thesis proposes multilateral research towards monitoring and performance diagnosis and prediction in cloud-based large-scale distributed systems by involving a novel combination of an effective and efficient deployment pipeline.The key contributions of this dissertation are listed below: - i - • Designing a real-time big data monitoring system called SmartMonit that efficiently collects the runtime system information including computing resource utilization and job execution information and then interacts the collected information with the Execution Graph modeled as directed acyclic graphs (DAGs). • Developing AutoDiagn, an automated real-time diagnosis framework for big data systems, that automatically detects performance degradation and inefficient resource utilization problems, while providing an online detection and semi-online root-cause analysis for a big data system. • Designing a novel root-cause analysis technique/system called BigPerf for big data systems that analyzes and characterizes the performance of big data applications by incorporating Bayesian networks to determine uncertain and complex relationships between performance related factors. The key contributions of this dissertation are listed below: - i - • Designing a real-time big data monitoring system called SmartMonit that efficiently collects the runtime system information including computing resource utilization and job execution information and then interacts the collected information with the Execution Graph modeled as directed acyclic graphs (DAGs). • Developing AutoDiagn, an automated real-time diagnosis framework for big data systems, that automatically detects performance degradation and inefficient resource utilization problems, while providing an online detection and semi-online root-cause analysis for a big data system. • Designing a novel root-cause analysis technique/system called BigPerf for big data systems that analyzes and characterizes the performance of big data applications by incorporating Bayesian networks to determine uncertain and complex relationships between performance related factors. The key contributions of this dissertation are listed below: - i - • Designing a real-time big data monitoring system called SmartMonit that efficiently collects the runtime system information including computing resource utilization and job execution information and then interacts the collected information with the Execution Graph modeled as directed acyclic graphs (DAGs). • Developing AutoDiagn, an automated real-time diagnosis framework for big data systems, that automatically detects performance degradation and inefficient resource utilization problems, while providing an online detection and semi-online root-cause analysis for a big data system. • Designing a novel root-cause analysis technique/system called BigPerf for big data systems that analyzes and characterizes the performance of big data applications by incorporating Bayesian networks to determine uncertain and complex relationships between performance related factors.State of the Republic of Turkey and the Turkish Ministry of National Educatio

Newcastle University eTheses