1,578 research outputs found

    Data-Driven Intelligent Scheduling For Long Running Workloads In Large-Scale Datacenters

    Get PDF
    Cloud computing is becoming a fundamental facility of society today. Large-scale public or private cloud datacenters spreading millions of servers, as a warehouse-scale computer, are supporting most business of Fortune-500 companies and serving billions of users around the world. Unfortunately, modern industry-wide average datacenter utilization is as low as 6% to 12%. Low utilization not only negatively impacts operational and capital components of cost efficiency, but also becomes the scaling bottleneck due to the limits of electricity delivered by nearby utility. It is critical and challenge to improve multi-resource efficiency for global datacenters. Additionally, with the great commercial success of diverse big data analytics services, enterprise datacenters are evolving to host heterogeneous computation workloads including online web services, batch processing, machine learning, streaming computing, interactive query and graph computation on shared clusters. Most of them are long-running workloads that leverage long-lived containers to execute tasks. We concluded datacenter resource scheduling works over last 15 years. Most previous works are designed to maximize the cluster efficiency for short-lived tasks in batch processing system like Hadoop. They are not suitable for modern long-running workloads of Microservices, Spark, Flink, Pregel, Storm or Tensorflow like systems. It is urgent to develop new effective scheduling and resource allocation approaches to improve efficiency in large-scale enterprise datacenters. In the dissertation, we are the first of works to define and identify the problems, challenges and scenarios of scheduling and resource management for diverse long-running workloads in modern datacenter. They rely on predictive scheduling techniques to perform reservation, auto-scaling, migration or rescheduling. It forces us to pursue and explore more intelligent scheduling techniques by adequate predictive knowledges. We innovatively specify what is intelligent scheduling, what abilities are necessary towards intelligent scheduling, how to leverage intelligent scheduling to transfer NP-hard online scheduling problems to resolvable offline scheduling issues. We designed and implemented an intelligent cloud datacenter scheduler, which automatically performs resource-to-performance modeling, predictive optimal reservation estimation, QoS (interference)-aware predictive scheduling to maximize resource efficiency of multi-dimensions (CPU, Memory, Network, Disk I/O), and strictly guarantee service level agreements (SLA) for long-running workloads. Finally, we introduced a large-scale co-location techniques of executing long-running and other workloads on the shared global datacenter infrastructure of Alibaba Group. It effectively improves cluster utilization from 10% to averagely 50%. It is far more complicated beyond scheduling that involves technique evolutions of IDC, network, physical datacenter topology, storage, server hardwares, operating systems and containerization. We demonstrate its effectiveness by analysis of newest Alibaba public cluster trace in 2017. We are the first of works to reveal the global view of scenarios, challenges and status in Alibaba large-scale global datacenters by data demonstration, including big promotion events like Double 11 . Data-driven intelligent scheduling methodologies and effective infrastructure co-location techniques are critical and necessary to pursue maximized multi-resource efficiency in modern large-scale datacenter, especially for long-running workloads

    Mitigating Interference During Virtual Machine Live Migration through Storage Offloading

    Get PDF
    Today\u27s cloud landscape has evolved computing infrastructure into a dynamic, high utilization, service-oriented paradigm. This shift has enabled the commoditization of large-scale storage and distributed computation, allowing engineers to tackle previously untenable problems without large upfront investment. A key enabler of flexibility in the cloud is the ability to transfer running virtual machines across subnets or even datacenters using live migration. However, live migration can be a costly process, one that has the potential to interfere with other applications not involved with the migration. This work investigates storage interference through experimentation with real-world systems and well-established benchmarks. In order to address migration interference in general, a buffering technique is presented that offloads the migration\u27s read, eliminating interference in the majority of scenarios

    HSM : a hybrid slowdown model for multitasking GPUs

    Get PDF
    Graphics Processing Units (GPUs) are increasingly widely used in the cloud to accelerate compute-heavy tasks. However, GPU-compute applications stress the GPU architecture in different ways - leading to suboptimal resource utilization when a single GPU is used to run a single application. One solution is to use the GPU in a multitasking fashion to improve utilization. Unfortunately, multitasking leads to destructive interference between co-running applications which causes fairness issues and Quality-of-Service (QoS) violations. We propose the Hybrid Slowdown Model (HSM) to dynamically and accurately predict application slowdown due to interference. HSM overcomes the low accuracy of prior white-box models, and training and implementation overheads of pure black-box models, with a hybrid approach. More specifically, the white-box component of HSM builds upon the fundamental insight that effective bandwidth utilization is proportional to DRAM row buffer hit rate, and the black-box component of HSM uses linear regression to relate row buffer hit rate to performance. HSM accurately predicts application slowdown with an average error of 6.8%, a significant improvement over the current state-of-the-art. In addition, we use HSM to guide various resource management schemes in multitasking GPUs: HSM-Fair significantly improves fairness (by 1.59x on average) compared to even partitioning, whereas HSM-QoS improves system throughput (by 18.9% on average) compared to proportional SM partitioning while maintaining the QoS target for the high-priority application in challenging mixed memory/compute-bound multi-program workloads

    Artificial intelligence driven anomaly detection for big data systems

    Get PDF
    The main goal of this thesis is to contribute to the research on automated performance anomaly detection and interference prediction by implementing Artificial Intelligence (AI) solutions for complex distributed systems, especially for Big Data platforms within cloud computing environments. The late detection and manual resolutions of performance anomalies and system interference in Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose AI-based methodologies for anomaly detection and interference prediction tailored to Big Data and containerized batch platforms to better analyze system performance and effectively utilize computing resources within cloud environments. Therefore, new precise and efficient performance management methods are the key to handling performance anomalies and interference impacts to improve the efficiency of data center resources. The first part of this thesis contributes to performance anomaly detection for in-memory Big Data platforms. We examine the performance of Big Data platforms and justify our choice of selecting the in-memory Apache Spark platform. An artificial neural network-driven methodology is proposed to detect and classify performance anomalies for batch workloads based on the RDD characteristics and operating system monitoring metrics. Our method is evaluated against other popular machine learning algorithms (ML), as well as against four different monitoring datasets. The results prove that our proposed method outperforms other ML methods, typically achieving 98โ€“99% F-scores. Moreover, we prove that a random start instant, a random duration, and overlapped anomalies do not significantly impact the performance of our proposed methodology. The second contribution addresses the challenge of anomaly identification within an in-memory streaming Big Data platform by investigating agile hybrid learning techniques. We develop TRACK (neural neTwoRk Anomaly deteCtion in sparK) and TRACK-Plus, two methods to efficiently train a class of machine learning models for performance anomaly detection using a fixed number of experiments. Our model revolves around using artificial neural networks with Bayesian Optimization (BO) to find the optimal training dataset size and configuration parameters to efficiently train the anomaly detection model to achieve high accuracy. The objective is to accelerate the search process for finding the size of the training dataset, optimizing neural network configurations, and improving the performance of anomaly classification. A validation based on several datasets from a real Apache Spark Streaming system is performed, demonstrating that the proposed methodology can efficiently identify performance anomalies, near-optimal configuration parameters, and a near-optimal training dataset size while reducing the number of experiments up to 75% compared with naรฏve anomaly detection training. The last contribution overcomes the challenges of predicting completion time of containerized batch jobs and proactively avoiding performance interference by introducing an automated prediction solution to estimate interference among colocated batch jobs within the same computing environment. An AI-driven model is implemented to predict the interference among batch jobs before it occurs within system. Our interference detection model can alleviate and estimate the task slowdown affected by the interference. This model assists the system operators in making an accurate decision to optimize job placement. Our model is agnostic to the business logic internal to each job. Instead, it is learned from system performance data by applying artificial neural networks to establish the completion time prediction of batch jobs within the cloud environments. We compare our model with three other baseline models (queueing-theoretic model, operational analysis, and an empirical method) on historical measurements of job completion time and CPU run-queue size (i.e., the number of active threads in the system). The proposed model captures multithreading, operating system scheduling, sleeping time, and job priorities. A validation based on 4500 experiments based on the DaCapo benchmarking suite was carried out, confirming the predictive efficiency and capabilities of the proposed model by achieving up to 10% MAPE compared with the other models.Open Acces

    Autonomous management of cost, performance, and resource uncertainty for migration of applications to infrastructure-as-a-service (IaaS) clouds

    Get PDF
    2014 Fall.Includes bibliographical references.Infrastructure-as-a-Service (IaaS) clouds abstract physical hardware to provide computing resources on demand as a software service. This abstraction leads to the simplistic view that computing resources are homogeneous and infinite scaling potential exists to easily resolve all performance challenges. Adoption of cloud computing, in practice however, presents many resource management challenges forcing practitioners to balance cost and performance tradeoffs to successfully migrate applications. These challenges can be broken down into three primary concerns that involve determining what, where, and when infrastructure should be provisioned. In this dissertation we address these challenges including: (1) performance variance from resource heterogeneity, virtualization overhead, and the plethora of vaguely defined resource types; (2) virtual machine (VM) placement, component composition, service isolation, provisioning variation, and resource contention for multitenancy; and (3) dynamic scaling and resource elasticity to alleviate performance bottlenecks. These resource management challenges are addressed through the development and evaluation of autonomous algorithms and methodologies that result in demonstrably better performance and lower monetary costs for application deployments to both public and private IaaS clouds. This dissertation makes three primary contributions to advance cloud infrastructure management for application hosting. First, it includes design of resource utilization models based on step-wise multiple linear regression and artificial neural networks that support prediction of better performing component compositions. The total number of possible compositions is governed by Bell's Number that results in a combinatorially explosive search space. Second, it includes algorithms to improve VM placements to mitigate resource heterogeneity and contention using a load-aware VM placement scheduler, and autonomous detection of under-performing VMs to spur replacement. Third, it describes a workload cost prediction methodology that harnesses regression models and heuristics to support determination of infrastructure alternatives that reduce hosting costs. Our methodology achieves infrastructure predictions with an average mean absolute error of only 0.3125 VMs for multiple workloads

    ๋ถ„์‚ฐ ๊ธฐ๊ณ„ ํ•™์Šต์˜ ์ž์› ํšจ์œจ์ ์ธ ์ˆ˜ํ–‰์„ ์œ„ํ•œ ๋™์  ์ตœ์ ํ™” ๊ธฐ์ˆ 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ์ „๋ณ‘๊ณค.Machine Learning(ML) systems are widely used to extract insights from data. Ever increasing dataset sizes and model complexity gave rise to many efforts towards ef๏ฌcient distributed machine learning systems. One of the popular approaches to support large scale data and complicated models is the parameter server (PS) approach. In this approach, a training job runs with distributed worker and server tasks, where workers iteratively compute gradients to update the global model parameters that are kept in servers. To improve the PS system performance, this dissertation proposes two solutions that automatically optimize resource ef๏ฌciency and system performance. First, we propose a solution that optimizes the resource con๏ฌguration and workload partitioning of distributed ML training on PS system. To ๏ฌnd the best con๏ฌguration, we build an Optimizer based on a cost model that works with online metrics. To ef๏ฌciently apply decisions by Optimizer, we design our runtime elastic to perform recon๏ฌguration in the background with minimal overhead. The second solution optimizes the scheduling of resources and tasks of multiple ML training jobs in a shared cluster. Speci๏ฌcally, we co-locate jobs with complementary resource use to increase resource utilization, while executing their tasks with ๏ฌne-grained unit to avoid resource contention. To alleviate memory pressure by co-located jobs, we enable dynamic spill/reload of data, which adaptively changes the ratio of data between disk and memory. We build a working system that implements our approaches. The above two solutions are implemented in the same system and share the runtime part that can dynamically migrate jobs between machines and reallocate machine resources. We evaluate our system with popular ML applications to verify the effectiveness of our solutions.๊ธฐ๊ณ„ ํ•™์Šต ์‹œ์Šคํ…œ์€ ๋ฐ์ดํ„ฐ์— ์ˆจ๊ฒจ์ง„ ์˜๋ฏธ๋ฅผ ๋ฝ‘์•„๋‚ด๊ธฐ ์œ„ํ•ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ์™€ ๋ชจ๋ธ์˜ ๋ณต์žก๋„๊ฐ€ ์–ด๋Š๋•Œ๋ณด๋‹ค ์ปค์ง์— ๋”ฐ๋ผ ํšจ์œจ์ ์ธ ๋ถ„์‚ฐ ๊ธฐ๊ณ„ ํ•™์Šต ์‹œ์Šคํ…œ์„์œ„ํ•œ ๋งŽ์€ ๋…ธ๋ ฅ๋“ค์ด ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ ์„œ๋ฒ„ ๋ฐฉ์‹์€ ๊ฑฐ๋Œ€ํ•œ ์Šค์ผ€์ผ์˜ ๋ฐ์ดํ„ฐ์™€ ๋ณต์žกํ•œ ๋ชจ๋ธ์„ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•œ ์œ ๋ช…ํ•œ ๋ฐฉ๋ฒ•๋“ค ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ด ๋ฐฉ์‹์—์„œ, ํ•™์Šต ์ž‘์—…์€ ๋ถ„์‚ฐ ์›Œ์ปค์™€ ์„œ๋ฒ„๋“ค๋กœ ๊ตฌ์„ฑ๋˜๊ณ , ์›Œ์ปค๋“ค์€ ํ• ๋‹น๋œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ทธ๋ ˆ๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์„œ๋ฒ„๋“ค์— ๋ณด๊ด€๋œ ๊ธ€๋กœ๋ฒŒ ๋ชจ๋ธ ํŒŒ ๋ผ๋ฏธํ„ฐ๋“ค์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ ์„œ๋ฒ„ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ž๋™์ ์œผ๋กœ ์ž์› ํšจ์œจ์„ฑ๊ณผ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋‘๊ฐ€์ง€์˜ ํ•ด๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ๋ฒˆ์งธ ํ•ด๋ฒ•์€, ํŒŒ๋ผ๋ฏธํ„ฐ ์‹œ์Šคํ…œ์—์„œ ๋ถ„์‚ฐ ๊ธฐ๊ณ„ ํ•™์Šต์„ ์ˆ˜ํ–‰์‹œ์— ์ž์› ์„ค์ • ๋ฐ ์›Œํฌ๋กœ๋“œ ๋ถ„๋ฐฐ๋ฅผ ์ž๋™ํ™”ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ตœ๊ณ ์˜ ์„ค์ •์„ ์ฐพ๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ์˜จ๋ผ์ธ ๋ฉ”ํŠธ๋ฆญ์„ ์‚ฌ์šฉํ•˜๋Š” ๋น„์šฉ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” Optimizer๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. Optimizer์˜ ๊ฒฐ์ •์„ ํšจ์œจ์ ์œผ๋กœ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ๋Ÿฐํƒ€์ž„์„ ๋™์  ์žฌ์„ค์ •์„ ์ตœ์†Œ์˜ ์˜ค๋ฒ„ํ—ค๋“œ๋กœ ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋””์ž์ธํ–ˆ๋‹ค. ๋‘๋ฒˆ์งธ ํ•ด๋ฒ•์€ ๊ณต์œ  ํด๋Ÿฌ์Šคํ„ฐ ์ƒํ™ฉ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ธฐ๊ณ„ ํ•™์Šต ์ž‘์—…์˜ ์„ธ๋ถ€ ์ž‘์—… ๊ณผ ์ž์›์˜ ์Šค์ผ€์ฅด๋ง์„ ์ตœ์ ํ™”ํ•œ ๊ฒƒ์ด๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ์„ธ๋ถ€ ์ž‘์—…๋“ค์„ ์„ธ๋ฐ€ํ•œ ๋‹จ์œ„๋กœ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ์ž์› ๊ฒฝ์Ÿ์„ ์–ต์ œํ•˜๊ณ , ์„œ๋กœ๋ฅผ ๋ณด์™„ํ•˜๋Š” ์ž์› ์‚ฌ์šฉ ํŒจํ„ด์„ ๋ณด์ด๋Š” ์ž‘์—…๋“ค์„ ๊ฐ™์€ ์ž์›์— ํ•จ๊ป˜ ์œ„์น˜์‹œ์ผœ ์ž์› ํ™œ์šฉ์œจ์„ ๋Œ์–ด์˜ฌ๋ ธ๋‹ค. ํ•จ๊ป˜ ์œ„์น˜ํ•œ ์ž‘์—…๋“ค์˜ ๋ฉ”๋ชจ๋ฆฌ ์••๋ ฅ์„ ๊ฒฝ๊ฐ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ๋™์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋””์Šคํฌ๋กœ ๋‚ด๋ ธ๋‹ค๊ฐ€ ๋‹ค์‹œ ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ฝ์–ด์˜ค๋Š” ๊ธฐ๋Šฅ์„ ์ง€์›ํ•จ๊ณผ ๋™์‹œ์—, ๋””์Šคํฌ์™€ ๋ฉ”๋ชจ๋ฆฌ๊ฐ„์˜ ๋ฐ์ดํ„ฐ ๋น„์œจ์„ ์ƒํ™ฉ์— ๋งž๊ฒŒ ์‹œ์Šคํ…œ์ด ์ž๋™์œผ๋กœ ๋งž์ถ”๋„๋ก ํ•˜์˜€๋‹ค. ์œ„์˜ ํ•ด๋ฒ•๋“ค์„ ์‹ค์ฒดํ™”ํ•˜๊ธฐ ์œ„ํ•ด, ์‹ค์ œ ๋™์ž‘ํ•˜๋Š” ์‹œ์Šคํ…œ์„ ๋งŒ๋“ค์—ˆ๋‹ค. ๋‘๊ฐ€์ง€์˜ ํ•ด๋ฒ•์„ ํ•˜๋‚˜์˜ ์‹œ์Šคํ…œ์— ๊ตฌํ˜„ํ•จ์œผ๋กœ์จ, ๋™์ ์œผ๋กœ ์ž‘์—…์„ ๋จธ์‹  ๊ฐ„์— ์˜ฎ๊ธฐ๊ณ  ์ž์›์„ ์žฌํ• ๋‹นํ•  ์ˆ˜ ์žˆ๋Š” ๋Ÿฐํƒ€์ž„์„ ๊ณต์œ ํ•œ๋‹ค. ํ•ด๋‹น ์†”๋ฃจ์…˜๋“ค์˜ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด, ์ด ์‹œ์Šคํ…œ์„ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๊ธฐ๊ณ„ ํ•™์Šต ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์œผ๋กœ ์‹คํ—˜ํ•˜์˜€๊ณ  ๊ธฐ์กด ์‹œ์Šคํ…œ๋“ค ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.Chapter1. Introduction 1 1.1 Distributed Machine Learning on Parameter Servers 1 1.2 Automating System Conguration of Distributed Machine Learning 2 1.3 Scheduling of Multiple Distributed Machine Learning Jobs 3 1.4 Contributions 5 1.5 Dissertation Structure 6 Chapter2. Background 7 Chapter3. Automating System Conguration of Distributed Machine Learning 10 3.1 System Conguration Challenges 11 3.2 Finding Good System Conguration 13 3.2.1 Cost Model 13 3.2.2 Cost Formulation 15 3.2.3 Optimization 16 3.3 Cruise 18 3.3.1 Optimizer 19 3.3.2 Elastic Runtime 21 3.4 Evaluation 26 3.4.1 Experimental Setup 26 3.4.2 Finding Baselines with Grid Search 28 3.4.3 Optimization in the Homogeneous Environment 28 3.4.4 Utilizing Opportunistic Resources 30 3.4.5 Optimization in the Heterogeneous Environment 31 3.4.6 Reconguration Speed 32 3.5 Related Work 33 3.6 Summary 34 Chapter4 A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs 36 4.1 Resource Under-utilization Problems in PS ML Training 37 4.2 Harmony Overview 42 4.3 Multiplexing ML Jobs 43 4.3.1 Fine-grained Execution with Subtasks 44 4.3.2 Dynamic Grouping of Jobs 45 4.3.3 Dynamic Data Reloading 54 4.4 Evaluation 56 4.4.1 Baselines 56 4.4.2 Experimental Setup 57 4.4.3 Performance Comparison 59 4.4.4 Performance Breakdown 59 4.4.5 Workload Sensitivity Analysis 61 4.4.6 Accuracy of the Performance Model 63 4.4.7 Performance and Scalability of the Scheduling Algorithm 64 4.4.8 Dynamic Data Reloading 66 4.5 Discussion 67 4.6 Related Work 67 4.7 Summary 70 Chapter5 Conclusion 71 5.1 Summary 71 5.2 Future Work 71 5.2.1 Other Communication Architecture Support 71 5.2.2 Deep Learning & GPU Resource Support 72 ์š”์•ฝ 81Docto

    Real-Time Virtualization and Cloud Computing

    Get PDF
    In recent years, we have observed three major trends in the development of complex real-time embedded systems. First, to reduce cost and enhance flexibility, multiple systems are sharing common computing platforms via virtualization technology, instead of being deployed separately on physically isolated hosts. Second, multi-core processors are increasingly being used in real-time systems. Third, developers are exploring the possibilities of deploying real-time applications as virtual machines in a public cloud. The integration of real-time systems as virtual machines (VMs) atop common multi-core platforms in a public cloud raises significant new research challenges in meeting the real-time latency requirements of applications. In order to address the challenges of running real-time VMs in the cloud, we first present RT-Xen, a novel real-time scheduling framework within the popular Xen hypervisor. We start with single-core scheduling in RT-Xen, and present the first work that empirically studies and compares different real-time scheduling schemes on a same platform. We then introduce RT-Xen 2.0, which focuses on multi-core scheduling and spanning multiple design spaces, including priority schemes, server schemes, and scheduling policies. Experimental results demonstrate that when combined with compositional scheduling theory, RT-Xen can deliver real-time performance to an application running in a VM, while the default credit scheduler cannot. After that, we present RT-OpenStack, a cloud management system designed to support co-hosting real-time and non-real-time VMs in a cloud. RT-OpenStack studies the problem of running real-time VMs together with non-real-time VMs in a public cloud. Leveraging the resource interface and real-time scheduling provided by RT-Xen, RT-OpenStack provides real-time performance guarantees to real-time VMs, while achieving high resource utilization by allowing non-real-time VMs to share the remaining CPU resources through a novel VM-to-host mapping scheme. Finally, we present RTCA, a real-time communication architecture for VMs sharing a same host, which maintains low latency for high priority inter-domain communication (IDC) traffic in the face of low priority IDC traffic
    • โ€ฆ
    corecore