27 research outputs found

    Bankrupt Covert Channel: Turning Network Predictability into Vulnerability

    Get PDF
    Recent years have seen a surge in the number of data leaks despite aggressive information-containment measures deployed by cloud providers. When attackers acquire sensitive data in a secure cloud environment, covert communication channels are a key tool to exfiltrate the data to the outside world. While the bulk of prior work focused on covert channels within a single CPU, they require the spy (transmitter) and the receiver to share the CPU, which might be difficult to achieve in a cloud environment with hundreds or thousands of machines. This work presents Bankrupt, a high-rate highly clandestine channel that enables covert communication between the spy and the receiver running on different nodes in an RDMA network. In Bankrupt, the spy communicates with the receiver by issuing RDMA network packets to a private memory region allocated to it on a different machine (an intermediary). The receiver similarly allocates a separate memory region on the same intermediary, also accessed via RDMA. By steering RDMA packets to a specific set of remote memory addresses, the spy causes deep queuing at one memory bank, which is the finest addressable internal unit of main memory. This exposes a timing channel that the receiver can listen on by issuing probe packets to addresses mapped to the same bank but in its own private memory region. Bankrupt channel delivers 74Kb/s throughput in CloudLab's public cloud while remaining undetectable to the existing monitoring capabilities, such as CPU and NIC performance counters.Comment: Published in WOOT 2020 co-located with USENIX Security 202

    Perph: A Workload Co-location Agent with Online Performance Prediction and Resource Inference

    Get PDF
    Striking a balance between improved cluster utilization and guaranteed application QoS is a long-standing research problem in cluster resource management. The majority of current solutions require a large number of sandboxed experimentation for different workload combinations and leverage them to predict possible interference for incoming workloads. This results in non-negligible time complexity that severely restricts its applicability to complex workload co-locations. The nature of pure offline profiling may also lead to model aging problem that drastically degrades the model precision. In this paper, we present Perph, a runtime agent on a per node basis, which decouples ML-based performance prediction and resource inference from centralized scheduler. We exploit the sensitivity of long-running applications to multi-resources for establishing a relationship between resource allocation and consequential performance. We use Online Gradient Boost Regression Tree (OGBRT) to enable the continuous model evolution. Once performance degradation is detected, resource inference is conducted to work out a proper slice of resources that will be reallocated to recover the target performance. The integration with Node Manager (NM) of Apache YARN shows that the throughput of Kafka data-streaming application is 2.0x and 1.82x times that of isolation execution schemes in native YARN and pure cgroup cpu subsystem. In TPC-C benchmarking, the throughput can also be improved by 35% and 23% respectively against YARN native and cgroup cpu subsystem

    ๊ฐ€์ƒํ™” ํ™˜๊ฒฝ์„ ์œ„ํ•œ ์›๊ฒฉ ๋ฉ”๋ชจ๋ฆฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021.8. Bernhard Egger.ํด๋ผ์šฐ๋“œ ํ™˜๊ฒฝ์€ ๊ฑฐ๋Œ€ํ•œ ์—ฐ์‚ฐ ์ž์›์„ ์ƒ์‹œ ๊ฐ€๋™ํ•  ํ•„์š” ์—†๊ณ  ์›ํ•˜๋Š” ์ˆœ๊ฐ„ ์›ํ•˜๋Š” ์–‘์˜ ๋Œ€ํ•œ ์—ฐ์‚ฐ ๋น„์šฉ๋งŒ์„ ์ง€๋ถˆํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์—, ์ตœ๊ทผ ์ธ๊ณต์ง€๋Šฅ ๋ฐ ๋น…๋ฐ์ดํ„ฐ ์—ฐ์‚ฐ์˜ ์œ ํ–‰์œผ๋กœ ์ธํ•ด ๊ทธ ์ˆ˜์š”๊ฐ€ ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ํด๋ผ์šฐ๋“œ ์ปดํ“จํŒ…์˜ ๋„์ž…์œผ๋กœ์ธํ•ด ๊ณ ๊ฐ์€ ์„œ๋ฒ„ ์œ ์ง€์— ๋Œ€ํ•œ ๋น„์šฉ์„ ํฌ๊ฒŒ ์ ˆ๊ฐํ•  ์ˆ˜ ์žˆ๊ณ  ์„œ๋น„์Šค ์ œ๊ณต์ž๋Š” ์—ฐ์‚ฐ ์ž์›์˜ ์ด์šฉ ํšจ์œจ์„ ๊ทน๋Œ€ํ™” ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๋ฐ์ดํ„ฐ์„ผํ„ฐ ์ž…์žฅ์—์„œ๋Š” ์—ฐ์‚ฐ ์ž์› ํ™œ์šฉ ํšจ์œจ์„ ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ๋ชฉํ‘œ๊ฐ€ ๋œ๋‹ค. ํŠนํžˆ ์ตœ๊ทผ ํญ์ฆํ•˜๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ ์„ผํ„ฐ์˜ ๊ทœ๋ชจ๋ฅผ ๊ณ ๋ คํ•˜๋ฉด ์ž‘์€ ํšจ์œจ ๊ฐœ์„ ์œผ๋กœ๋„ ๋ง‰๋Œ€ํ•œ ๊ฒฝ์ œ์  ๊ฐ€์น˜๋ฅผ ์ฐฝ์ถœ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ฐ์ดํ„ฐ ์„ผํ„ฐ์˜ ํšจ์œจ์€ ์œ„์น˜ ์„ ์ •, ๊ตฌ์กฐ ์„ค๊ณ„, ๋ƒ‰๊ฐ ์‹œ์Šคํ…œ, ํ•˜๋“œ์›จ์–ด ๊ตฌ์„ฑ ๋“ฑ๋“ฑ ๋‹ค์–‘ํ•œ ์š”์†Œ๋“ค์— ์˜ํ–ฅ์„ ๋ฐ›์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํŠนํžˆ ์—ฐ์‚ฐ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ์ž์›์„ ๊ด€๋ฆฌํ•˜๋Š” ์†Œํ”„ํŠธ์›จ์–ด ์„ค๊ณ„ ๋ฐ ๊ตฌํ˜„์„ ๋‹ค๋ฃฌ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฐ์ดํ„ฐ ์„ผํ„ฐ ํšจ์œจ ๊ฐœ์„ ์„ ํš๊ธฐ์ ์œผ๋กœ ๊ฐœ์„ ํ•˜๋Š” ๋‘๊ฐ€์ง€ ์†Œํ”„ํŠธ์›จ์–ด ๊ธฐ๋ฐ˜ ๊ธฐ์ˆ ์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ์งธ๋กœ ๊ฐ€์ƒํ™” ํ™˜๊ฒฝ์„ ์œ„ํ•œ ์†Œํ”„ํŠธ์›จ์–ด ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ๋ถ„๋ฆฌ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ์ตœ๊ทผ ๊ณ ์† ๋„คํŠธ์›Œํฌ์˜ ๋ฐœ์ „์œผ๋กœ ์ธํ•ด ์›๊ฒฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋น„์šฉ์ด ํš๊ธฐ์ ์œผ๋กœ ์ค„์–ด ๋“ค์—ˆ๊ณ , ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๊ณ ์„ฑ๋Šฅ ๋„คํŠธ์›Œํ‚น ํ•˜๋“œ์›จ์–ด๋ฅผ ์ด์šฉํ•˜์—ฌ ์›๊ฒฉ ๋ฉ”๋ชจ๋ฆฌ ์œ„์—์„œ ์‹คํ–‰๋˜๋Š” ๊ฐ€์ƒ ๋จธ์‹ ์˜ ํฐ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ์ˆ ์„ QEMU/KVM ๊ฐ€์ƒ๋จธ์‹  ํ•˜์ดํผ๋ฐ”์ด์ €๋ฅผ ํ†ตํ•ด ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ, ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๊ธฐ๋ฒ•์€ ๊ธฐ์กด ์‹œ์Šคํ…œ ๋Œ€๋น„ ์›๊ฒฉ ํŽ˜์ด์ง•์— ๋Œ€ํ•œ ๊ผฌ๋ฆฌ ์ง€์—ฐ์‹œ๊ฐ„์„ 98.2% ๊ฐœ์„ ํ•จ์„ ๋ณด์ธ๋‹ค. ๋˜ํ•œ ๋ž™ ๊ทœ๋ชจ์˜ ์ž‘์—…์ฒ˜๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ํ†ตํ•œ ์‹คํ—˜์—์„œ, ์ œ์•ˆ๋œ ์‹œ์Šคํ…œ์€ ์ „์ฒด ์ž‘์—… ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ๊ธฐ์กด ์‹œ์Šคํ…œ ๋Œ€๋น„ 40.9% ์ค„์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ ์›๊ฒฉ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ด์šฉํ•˜๋Š” ์ฆ‰๊ฐ์ ์ธ ๊ฐ€์ƒ๋จธ์‹  ์ด์ฃผ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜๋‹ค. ๊ฐ€์ƒํ™” ํ™˜๊ฒฝ์˜ ์›๊ฒฉ ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ์— ๋Œ€ํ•œ ํ™•์žฅ์€ ๊ทธ๊ฒƒ๋งŒ์œผ๋กœ ์ž์› ์ด์šฉ๋ฅ  ํ–ฅ์ƒ์— ๋Œ€ํ•ด ํฐ ๊ธฐ์—ฌ๋ฅผ ํ•˜์ง€๋งŒ, ์—ฌ์ „ํžˆ ํ•œ ์„œ๋ฒ„์—์„œ ์—ฌ๋Ÿฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์ด ๊ฒฝ์Ÿ์ ์œผ๋กœ ์ž์›์„ ์ด์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ €ํ•˜ ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ์ฆ‰๊ฐ์ ์ธ ๊ฐ€์ƒ๋จธ์‹  ์ด์ฃผ ๊ธฐ๋ฒ•์€ ์›๊ฒฉ ๋ฉ”๋ชจ๋ฆฌ ์ƒ์—์„œ ์•„์ฃผ ์ž‘์€ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ์˜ ์ „์†ก๋งŒ์œผ๋กœ ๊ฐ€์ƒ๋จธ์‹ ์˜ ์ด์ฃผ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ ์ƒ์— ํ‚ค์™€ ๊ฐ’์„ ์ €์žฅํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ๋ฒค์น˜๋งˆํฌ๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๊ฐ€์ƒ๋จธ์‹ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ํ‰๊ฐ€์—์„œ ๊ธฐ์กด ๊ธฐ๋ฒ•๋Œ€๋น„ ์‹ค์งˆ์ ์ธ ์„œ๋น„์Šค ์ค‘๋‹จ์‹œ๊ฐ„์„ ์ตœ๋Œ€ 92.6% ๊ฐœ์„ ํ•จ์„ ๋ณด์ธ๋‹ค.The raising importance of big data and artificial intelligence (AI) has led to an unprecedented shift in moving local computation into the cloud. One of the key drivers behind this transformation was the exploding cost of owning and maintaining large computing systems powerful enough to process these new workloads. Customers experience a reduced cost by renting only the required resources and only when needed, while data center operators benefit from efficiency at scale. A key factor in operating a profitable data center is a high overall utilization of its resources. Due to the scale of modern data centers, small improvements in efficiency translate to significant savings in the total cost of ownership (TCO). There are many important elements that constitute an efficient data center such as its location, architecture, cooling system, or the employed hardware. In this thesis, we focus on software-related aspects, namely the utilization of computational and memory resources. Reports from data centers operated by Alibaba and Google show that the overall resource utilization has stagnated at a level of around 50 to 60 percent over the past decade. This low average utilization is mostly attributable to peak demand-driven resource allocation despite the high variability of modern workloads in their resource usage. In other words, data centers today lack an efficient way to put idle resources that are reserved but not used to work. In this dissertation we present RackMem, a software-based solution to address the problem of low resource utilization through two main contributions. First, we introduce a disaggregated memory system tailored for virtual environments. We observe that virtual machines can use remote memory without noticeable performance degradation under moderate memory pressure on modern networking infrastructure. We implement a specialized remote paging system for QEMU/KVM that reduces the remote paging tail-latency by 98.2% in comparison to the state of the art. A job processing simulation at rack-scale shows that the total makespan can be reduced by 40.9% under our memory system. While seamless disaggregated memory helps to balance memory usage across nodes, individual nodes can still suffer overloaded resources if co-located workloads exhibit high resource usage at the same time. In a second contribution, we present a novel live migration technique for machines running on top of our remote paging system. Under this instant live migration technique, entire virtual machines can be migrated in as little as 100 milliseconds. An evaluation with in-memory key-value database workloads shows that the presented migration technique improves the state of the art by a wide margin in all key performance metrics. The presented software-based solutions lay the technical foundations that allow data center operators to significantly improve the utilization of their computational and memory resources. As future work, we propose new job schedulers and load balancers to make full use of these new technical foundations.Chapter 1. Introduction 1 1.1 Contributions of the Dissertation 3 Chapter 2. Background 5 2.1 Resource Disaggregation 5 2.2 Transparent Remote Paging 7 2.3 Remote Direct Memory Access (RDMA) 9 2.4 Live Migration of Virtual Machines 10 Chapter 3. RackMem Overview 13 3.1 RackMem Virtual Memory 13 3.2 RackMem Distributed Virtual Storage 14 3.3 RackMem Networking 15 3.4 Instant VM Live Migration 16 Chapter 4. Virtual Memory 17 4.1 Design Considerations for Achieving Low-latency 19 4.2 Pagefault handling 20 4.2.1 Fast-path and slow-path in the pagefault handler 21 4.2.2 State transition of RackVM page 23 4.3 Latency Hiding Techniques 25 4.4 Implementation 26 4.4.1 RackMem Virtual Memory Module 27 4.4.2 Dynamic Rebalancing of Local Memory 29 4.4.3 RackVM for Virtual Machines 29 4.4.4 Running Unmodified Applications 30 Chapter 5. RackMem Distributed Virtual Storage 31 5.1 The distributed Storage Abstraction 32 5.2 Memory Management 33 5.2.1 Remote memory allocation 33 5.2.2 Remote memory reclamation 33 5.3 Fault Tolerance 34 5.3.1 Fault-tolerance and Write-duplication 34 5.4 Multiple Storage Support in RackMem 36 5.5 Implementation 38 5.5.1 The Remote Memory Backend 38 5.5.2 Linux Demand Paging on RackDVS 39 Chapter 6. Networking 40 6.1 Design of RackNet 40 6.2 Implementation 41 6.2.1 RPC message layout 41 6.2.2 RackNet RPC Implementation 42 Chapter 7. Instant VM Live Migration 44 7.1 Motivation 45 7.1.1 The need for a tailored live migration technique 45 7.1.2 Software Bottlenecks 46 7.1.3 Utilizing workload variability 46 7.2 Design of Instant 47 7.2.1 Instant Region Migration 47 7.3 Implementation 48 7.3.1 Extension of RackVM for Instant 49 7.3.2 Instant region migration 49 7.3.3 Pre-fetch optimizations 51 7.3.4 Downtime optimizations 51 7.3.5 QEMU modification for Instant 52 Chapter 8. Evaluation - RackMem 53 8.1 Execution Environment 54 8.2 Pagefault Handler Latency 56 8.3 Single Application Performance 57 8.3.1 Batch-oriented Applications 58 8.3.2 Internal Pagesize and Performance 59 8.3.3 Write-duplication overhead 60 8.3.4 RackDVS slab size and performance 62 8.3.5 Latency-oriented Applications 63 8.3.6 Network Bandwidth Analysis 64 8.3.7 Dynamic Local Memory Partitioning 66 8.3.8 Rack-scale Job Processing Simulation 67 Chapter 9. Evaluation - Instant VM Live Migration 69 9.1 Experimental setup 69 9.2 Target Applications 70 9.3 Comparison targets 70 9.4 Database and client setups 71 9.5 Memory disaggregation scenarios 71 9.6.1 Time-to-responsiveness 71 9.6.2 Effective Downtime 73 9.6.3 Effect of Instant optimizations 75 Chapter 10. Conclusion 77 10.1 Future Directions 78 ์š”์•ฝ 89๋ฐ•

    Performance-Aware Speculative Resource Oversubscription for Large-Scale Clusters

    Get PDF
    It is a long-standing challenge to achieve a high degree of resource utilization in cluster scheduling. Resource oversubscription has become a common practice in improving resource utilization and cost reduction. However, current centralized approaches to oversubscription suffer from the issue with resource mismatch and fail to take into account other performance requirements, e.g., tail latency. In this article we present ROSE, a new resource management platform capable of conducting performance-aware resource oversubscription. ROSE allows latency-sensitive long-running applications (LRAs) to co-exist with computation-intensive batch jobs. Instead of waiting for resource allocation to be confirmed by the centralized scheduler, job managers in ROSE can independently request to launch speculative tasks within specific machines according to their suitability for oversubscription. Node agents of those machines can however, avoid any excessive resource oversubscription by means of a mechanism for admission control using multi-resource threshold control and performance-aware resource throttle. Experiments show that in case of mixed co-location of batch jobs and latency-sensitive LRAs, the CPU utilization and the disk utilization can reach 56.34 and 43.49 percent, respectively, but the 95th percentile of read latency in YCSB workloads only increases by 5.4 percent against the case of executing the LRAs alone

    Machine Learning Defence Mechanism for Securing the Cloud Environment

    Get PDF
    A computer paradigm known as โ€cloud computingโ€ offers end users on-demand, scalable, and measurable services. Todayโ€™s businesses rely heavily on computer technology for a variety of reasons, including cost savings, infrastructure, development platforms, data processing, data analytics, etc. The end users can access the cloud service providersโ€™ (CSP) services from any location at any time using a web application. The protection of the cloud infrastructure is of the highest  significance, and several studies using a variety of technologies have been conducted to develop more effective defenses against cloud threats. In recent years, machine learning technology has shown to be more effective in securing the cloud environment. In recent years, machine learning technology has shown to be more effective in securing the cloud environment. To create models that can automate the process of identifying cloud threats with better accuracy than any other technology, machine learning algorithms are  trained  on  a  variety  of  real-world  datasets. In this study, various recent research publications that used machine learning as a defense mechanism against cloud threats are reviewed

    Proactive Interference-aware Resource Management in Deep Learning Training Cluster

    Get PDF
    Deep Learning (DL) applications are growing at an unprecedented rate across many domains, ranging from weather prediction, map navigation to medical imaging. However, training these deep learning models in large-scale compute clusters face substantial challenges in terms of low cluster resource utilisation and high job waiting time. State-of-the-art DL cluster resource managers are needed to increase GPU utilisation and maximise throughput. While co-locating DL jobs within the same GPU has been shown to be an effective means towards achieving this, co-location subsequently incurs performance interference resulting in job slowdown. We argue that effective workload placement can minimise DL cluster interference at scheduling runtime by understanding the DL workload characteristics and their respective hardware resource consumption. However, existing DL cluster resource managers reserve isolated GPUs to perform online profiling to directly measure GPU utilisation and kernel patterns for each unique submitted job. Such a feedback-based reactive approach results in additional waiting times as well as reduced cluster resource efficiency and availability. In this thesis, we propose Horus: an interference-aware and prediction-based DL cluster resource manager. Through empirically studying a series of microbenchmarks and DL workload co-location combinations across heterogeneous GPU hardware, we demonstrate the negative effects of performance interference when colocating DL workload, and identify GPU utilisation as a general proxy metric to determine good placement decisions. From these findings, we design Horus, which in contrast to existing approaches, proactively predicts GPU utilisation of heterogeneous DL workload extrapolated from the DL model computation graph features when performing placement decisions, removing the need for online profiling and isolated reserved GPUs. By conducting empirical experimentation within a medium-scale DL cluster as well as a large-scale trace-driven simulation of a production system, we demonstrate Horus improves cluster GPU utilisation, reduces cluster makespan and waiting time, and can scale to operate within hundreds of machines

    Edge-Facilitated Mobile Computing and Communication

    Get PDF
    The proliferation of IoT devices and rapidly developing wireless techniques boost the data volume and service demand at the edge of the Internet. Meanwhile, increased requirement for low latency feedback has become a must for most popular mobile applications, e.g., Augmented Reality (AR), Virtual Reality (VR) and Connected Vehicles. To address these challenges, edge computing has emerged as an extensional solution for cloud computing. This thesis studies edge computing-facilitated mobile computing and communication systems. We first propose solutions to improve edge resource utilization regarding general edge systems. We present a mechanism to cluster user requests based on similarity for better Content Delivery Net- work (CDN) performance. This mechanism works directly on current CDN architecture and can be deployed incrementally. Then we extend the mechanism by adding cache resource grouping algorithm, so that the system directs similar requests to same servers and group those servers which receive similar requests. This iterative mechanism optimizes the edge utilization by concentrating the resource on similar requests to achieve higher cache hit ratio and computation efficiency. Thereafter, we present solutions for mobile edge systems specifically for three most promising use cases, i.e., Connected Vehicles, Mobile AR (MAR) and Smart city (traffic control). We explore the potential of edge computing in connected vehicular AR applications with real data sets. We design a lightweight edge system and data flow fit for general connected vehicular AR applications and implement a prototype. With an indoor test and real data set analysis, we find out that our system can improve the performance of vehicular AR applications with reasonable cost. To optimize the system, we formulate the problem of edge server allocation and task scheduling as a mutant multiprocessor scheduling problem and develop a two-stage edge-cloud decentralized algorithm as well as a centralized algorithm to schedule the offloading tasks on the fly. We conduct a raw road test and an extensive evaluation based on the road test results and large data sets from real world. The results show that our system improve at least twice the application performance comparing with cloud solutions. For MAR, we consider to offload tasks to multiple edge servers via multiple paths simultaneously to further improve the MAR performance. We develop a fast scheduling algorithm to split the workloads among the avail- able edge servers and show promising results with real implementations. At last, we explore the potential of combining edge computing and ma- chine learning techniques to realize intelligent traffic control by letting edge servers co-located with traffic lights learn the waiting traffic and adapt the light periods with reinforcement learning.Esineiden Internetin leviรคminen ja nopeasti kehittyvรคt langattomat tekniikat lisรครคvรคt datan mรครคrรครค ja palvelutarvetta Internetin reunalla. Samanaikaisesti lisรครคntyneestรค alhaisen viiveen palautteen vaatimuksesta on tullut vรคlttรคmรคtรถn suosituimpiin mobiilisovelluksiin, esim. lisรคttyyn todellisuuteen (AR), virtuaalitodellisuuteen (VR) ja yhdistettyihin ajoneuvoihin. Reunalaskenta on noussut pilvilaskennan rinnalle nรคihin haasteisiin vastaavaksi ratkaisuksi. Tรคssรค vรคitรถskirjassa tutkitaan laskennallisesti laajennettuja mobiililaskenta- ja viestintรคjรคrjestelmiรค. Ehdotamme ensin ratkaisuja reunaresurssien kรคytรถn parantamiseksi yleisten reunajรคrjestelmien suhteen. Esitรคmme mekanismin kรคyttรคjien pyyntรถjen klusterointiin perustuen samankaltaisuuteen sisรคllรถnjakeluverkon (CDN) suorituskyvyn parantamiseksi. Tรคmรค mekanismi toimii suoraan nykyisessรค CDN-arkkitehtuureissa ja voidaan ottaa kรคyttรถรถn asteittain. Sitten laajennamme mekanismia lisรครคmรคllรค vรคlimuistiresurssien ryhmittelyalgoritmin siten, ettรค jรคrjestelmรค ohjaa samankaltaiset pyynnรถt samoille palvelimille ja ryhmittelee palvelimet pyyntรถjen mukaan. Tรคmรค iteratiivinen mekanismi optimoi reunakรคytรถn keskittรคmรคllรค resurssit samanlaisiin pyyntรถihin suuremman vรคlimuistin osumissuhteen ja laskentatehokkuuden saavuttamiseksi. Sen jรคlkeen esittelemme ratkaisuja liikkuviin reunajรคrjestelmiin erityisesti kolmeen lupaavimpaan kรคyttรถtapaukseen, ts. yhdistetyt ajoneuvot, laajennettu mobiilitodellisuus (MAR) ja รคlykรคs kaupunki (erityisesti liikenteenohjaus). Tutkimme reunalaskennan mahdollisuuksia yhdistettyjen ajoneuvojen AR-sovelluksissa. Suunnittelemme kevyen reunajรคrjestelmรคn ja tiedonkulun, joka sopii yleisesti yhdistettyjen ajoneuvojen AR-sovelluksiin ja toteutamme prototyypin. Sisรคtilojen testin ja reaalimaailman datan avulla saamme selville, ettรค jรคrjestelmรคmme voi parantaa ajoneuvojen AR-sovellusten suorituskykyรค kohtuullisin kustannuksin. Jรคrjestelmรคn optimoimiseksi formuloimme reunapalvelimien allokoinnin ja tehtรคvien ajoituksen ongelman muuttuvana moniprosessorien skedulointiongelmana ja kehitรคmme kaksivaiheisen reunapilviin soveltuvan hajautetun algoritmin sekรค keskitetyn algoritmin kuormansiirtotehtรคvien ajonaikaiseen ajoittamiseen. Suoritamme kokeellisen testin oikeassa ajossa ja datapohjaisen arvioinnin, joka perustuu tietestien tuloksiin ja todellisen maailman suuriin tietojoukkoihin. Tulokset osoittavat, ettรค jรคrjestelmรคmme parantaa merkittรคvรคsti sovelluksen suorituskykyรค verrattuna pilviratkaisuihin. MAR:n osalta kรคsittelemme tehtรคvien lataamista useille reunapalvelimille useiden reittien kautta samanaikaisesti MAR:n suorituskyvyn parantamiseksi. Kehitรคmme nopean aikataulutusalgoritmin tyรถkuormien jakamiseen kรคytettรคvissรค olevien reunapalvelimien. Lopuksi tutkimme mahdollisuuksia yhdistรครค reunalaskenta ja koneoppimistekniikat รคlykkรครคn liikennevalo-ohjauksen toteuttamiseksi liikennevaloihin sijoitetuilla reunapalvelimilla

    On the use of intelligent models towards meeting the challenges of the edge mesh

    Get PDF
    Nowadays, we are witnessing the advent of the Internet of Things (IoT) with numerous devices performing interactions between them or with their environment. The huge number of devices leads to huge volumes of data that demand the appropriate processing. The โ€œlegacyโ€ approach is to rely on Cloud where increased computational resources can realize any desired processing. However, the need for supporting real-time applications requires a reduced latency in the provision of outcomes. Edge Computing (EC) comes as the โ€œsolverโ€ of the latency problem. Various processing activities can be performed at EC nodes having direct connection with IoT devices. A number of challenges should be met before we conclude a fully automated ecosystem where nodes can cooperate or understand their status to efficiently serve applications. In this article, we perform a survey of the relevant research activities towards the vision of Edge Mesh (EM), i.e., a โ€œcoverโ€ of intelligence upon the EC. We present the necessary hardware and discuss research outcomes in every aspect of EC/EM nodes functioning. We present technologies and theories adopted for data, tasks, and resource management while discussing how machine learning and optimization can be adopted in the domain

    Real-time performance diagnosis and evaluation of big data systems in cloud datacenters

    Get PDF
    PhD ThesisModern big data processing systems are becoming very complex in terms of largescale, high-concurrency and multiple talents. Thus, many failures and performance reductions only happen at run-time and are very difficult to capture. Moreover, some issues may only be triggered when some components are executed. To analyze the root cause of these types of issues, we have to capture the dependencies of each component in real-time. Big data processing systems, such as Hadoop and Spark, usually work in large-scale, highly-concurrent, and multi-tenant environments that can easily cause hardware and software malfunctions or failures, thereby leading to performance degradation. Several systems and methods exist to detect big data processing systemsโ€™ performance degradation, perform root-cause analysis, and even overcome the issues causing such degradation. However, these solutions focus on specific problems such as stragglers and inefficient resource utilization. There is a lack of a generic and extensible framework to support the real-time diagnosis of big data systems. Performance diagnosis and prediction of big data systems are highly complex as these frameworks are typically deployed in cloud data centers that are large-scale, highly concurrent, and follows a multi-tenant model. Several factors, including hardware heterogeneity, stochastic networks and application workloads may impact the performance of big data systems. The current state-of-the-art does not sufficiently address the challenge of determining complex, usually stochastic and hidden relationships between these factors. To handle performance diagnosis and evaluation of big data systems in cloud environments, this thesis proposes multilateral research towards monitoring and performance diagnosis and prediction in cloud-based large-scale distributed systems by involving a novel combination of an effective and efficient deployment pipeline.The key contributions of this dissertation are listed below: - i - โ€ข Designing a real-time big data monitoring system called SmartMonit that efficiently collects the runtime system information including computing resource utilization and job execution information and then interacts the collected information with the Execution Graph modeled as directed acyclic graphs (DAGs). โ€ข Developing AutoDiagn, an automated real-time diagnosis framework for big data systems, that automatically detects performance degradation and inefficient resource utilization problems, while providing an online detection and semi-online root-cause analysis for a big data system. โ€ข Designing a novel root-cause analysis technique/system called BigPerf for big data systems that analyzes and characterizes the performance of big data applications by incorporating Bayesian networks to determine uncertain and complex relationships between performance related factors. The key contributions of this dissertation are listed below: - i - โ€ข Designing a real-time big data monitoring system called SmartMonit that efficiently collects the runtime system information including computing resource utilization and job execution information and then interacts the collected information with the Execution Graph modeled as directed acyclic graphs (DAGs). โ€ข Developing AutoDiagn, an automated real-time diagnosis framework for big data systems, that automatically detects performance degradation and inefficient resource utilization problems, while providing an online detection and semi-online root-cause analysis for a big data system. โ€ข Designing a novel root-cause analysis technique/system called BigPerf for big data systems that analyzes and characterizes the performance of big data applications by incorporating Bayesian networks to determine uncertain and complex relationships between performance related factors. The key contributions of this dissertation are listed below: - i - โ€ข Designing a real-time big data monitoring system called SmartMonit that efficiently collects the runtime system information including computing resource utilization and job execution information and then interacts the collected information with the Execution Graph modeled as directed acyclic graphs (DAGs). โ€ข Developing AutoDiagn, an automated real-time diagnosis framework for big data systems, that automatically detects performance degradation and inefficient resource utilization problems, while providing an online detection and semi-online root-cause analysis for a big data system. โ€ข Designing a novel root-cause analysis technique/system called BigPerf for big data systems that analyzes and characterizes the performance of big data applications by incorporating Bayesian networks to determine uncertain and complex relationships between performance related factors.State of the Republic of Turkey and the Turkish Ministry of National Educatio
    corecore