238 research outputs found

    Overview of Caching Mechanisms to Improve Hadoop Performance

    Full text link
    Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a tool to store and process large datasets in a parallel manner across a cluster of machines in a distributed environment. Hadoop brings many benefits like flexibility, scalability, and high fault tolerance; however, it faces some challenges in terms of data access time, I/O operation, and duplicate computations resulting in extra overhead, resource wastage, and poor performance. Many researchers have utilized caching mechanisms to tackle these challenges. For example, they have presented approaches to improve data access time, enhance data locality rate, remove repetitive calculations, reduce the number of I/O operations, decrease the job execution time, and increase resource efficiency. In the current study, we provide a comprehensive overview of caching strategies to improve Hadoop performance. Additionally, a novel classification is introduced based on cache utilization. Using this classification, we analyze the impact on Hadoop performance and discuss the advantages and disadvantages of each group. Finally, a novel hybrid approach called Hybrid Intelligent Cache (HIC) that combines the benefits of two methods from different groups, H-SVM-LRU and CLQLMRS, is presented. Experimental results show that our hybrid method achieves an average improvement of 31.2% in job execution time

    MapReduce analysis for cloud-archived data

    Get PDF
    Public storage clouds have become a popular choice for archiving certain classes of enterprise data - for example, application and infrastructure logs. These logs contain sensitive information like IP addresses or user logins due to which regulatory and security requirements often require data to be encrypted before moved to the cloud. In order to leverage such data for any business value, analytics systems (e.g. Hadoop/MapReduce) first download data from these public clouds, decrypt it and then process it at the secure enterprise site. We propose VNCache: an efficient solution for MapReduceanalysis of such cloud-archived log data without requiring an apriori data transfer and loading into the local Hadoop cluster. VNcache dynamically integrates cloud-archived data into a virtual namespace at the enterprise Hadoop cluster. Through a seamless data streaming and prefetching model, Hadoop jobs can begin execution as soon as they are launched without requiring any apriori downloading. With VNcache's accurate pre-fetching and caching, jobs often run on a local cached copy of the data block significantly improving performance. When no longer needed, data is safely evicted from the enterprise cluster reducing the total storage footprint. Uniquely, VNcache is implemented with NO changes to the Hadoop application stack. ยฉ 2014 IEEE

    CUDA Unified Memory๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ ๋ฐ ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2022. 8. ์ด์žฌ์ง„.Unified Memory (UM) is a component of CUDA programming model which provides a memory pool that has a single address space and can be accessed by both the host and the GPU. When UM is used, a CUDA program does not need to explicitly move data between the host and the device. It also allows GPU memory oversubscription by using CPU memory as a backing store. UM significantly lessens the burden of a programmer and provides great programmability. However, using UM solely does not guarantee good performance. To fully exploit UM and improve performance, the programmer needs to add user hints to the source code to prefetch pages that are going to be accessed during the CUDA kernel execution. In this thesis, we propose three frameworks that exploits UM to improve the ease-of-programming while maximizing the application performance. The first framework is HUM, which hides host-to-device memory copy time of traditional CUDA program without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory. The second framework is DeepUM which exploits UM to allow GPU memory oversubscription for deep neural networks. While UM allows memory oversubscription using a page fault mechanism, page fault handling introduces enormous overhead. We use a correlation prefetching technique to solve the problem and hide the overhead. The evaluation result shows that DeepUM achieves comparable performance to the other state-of-the-art approaches. At the same time, our framework can run larger batch size that other methods fail to run. The last framework is SnuRHAC that provides an illusion of a single GPU for the multiple GPUs in a cluster. Under SnuRHAC, a CUDA program designed to use a single GPU can utilize multiple GPUs in a cluster without any source code modification. SnuRHAC automatically distributes workload to multiple GPUs in a cluster and manages data across the nodes. To manage data efficiently, SnuRHAC extends Unified Memory and exploits its page fault mechanism. We also propose two prefetching techniques to fully exploit UM and to maximize performance. The evaluation result shows that while SnuRHAC significantly improves ease-of-programming, it shows scalable performance for the cluster environment depending on the application characteristics.Unified Memory (UM)๋Š” CUDA ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ธฐ๋Šฅ ์ค‘ ํ•˜๋‚˜๋กœ ๋‹จ์ผ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ ๊ณต๊ฐ„์— CPU์™€ GPU๊ฐ€ ๋™์‹œ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ์ด์— ๋”ฐ๋ผ, UM์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ CUDA ํ”„๋กœ๊ทธ๋žจ์—์„œ ๋ช…์‹œ์ ์œผ๋กœ ํ”„๋กœ์„ธ์„œ๊ฐ„์— ๋ฐ์ดํ„ฐ๋ฅผ ์ด๋™์‹œ์ผœ์ฃผ์ง€ ์•Š์•„๋„ ๋œ๋‹ค. ๋˜ํ•œ, CPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ backing store๋กœ ์‚ฌ์šฉํ•˜์—ฌ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ, UM์€ ํ”„๋กœ๊ทธ๋ž˜๋จธ์˜ ๋ถ€๋‹ด์„ ํฌ๊ฒŒ ๋œ์–ด์ฃผ๊ณ  ์‰ฝ๊ฒŒ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค€๋‹ค. ํ•˜์ง€๋งŒ, UM์„ ์žˆ๋Š” ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์„ฑ๋Šฅ ์ธก๋ฉด์—์„œ ์ข‹์ง€ ์•Š๋‹ค. UM์€ page fault mechanism์„ ํ†ตํ•ด ๋™์ž‘ํ•˜๋Š”๋ฐ page fault๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋งŽ์€ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ ์ตœ๋Œ€์˜ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ์†Œ์Šค ์ฝ”๋“œ์— ์—ฌ๋Ÿฌ ํžŒํŠธ๋‚˜ ์•ž์œผ๋กœ CUDA ์ปค๋„์—์„œ ์‚ฌ์šฉ๋  ๋ฉ”๋ชจ๋ฆฌ ์˜์—ญ์— ๋Œ€ํ•œ ํ”„๋ฆฌํŽ˜์น˜ ๋ช…๋ น์„ ์‚ฝ์ž…ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ์‰ฌ์šด ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ์ตœ๋Œ€์˜ ์„ฑ๋Šฅ์ด๋ผ๋Š” ๋‘๋งˆ๋ฆฌ ํ† ๋ผ๋ฅผ ๋™์‹œ์— ์žก๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ฒซ์งธ๋กœ, HUM์€ ๊ธฐ์กด CUDA ํ”„๋กœ๊ทธ๋žจ์˜ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•˜์ง€ ์•Š๊ณ  ํ˜ธ์ŠคํŠธ์™€ ๋””๋ฐ”์ด์Šค ๊ฐ„์— ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ์‹œ๊ฐ„์„ ์ตœ์†Œํ™”ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, UM๊ณผ fault mechanism์„ ์‚ฌ์šฉํ•˜์—ฌ ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๊ฐ„ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ํ˜ธ์ŠคํŠธ ๊ณ„์‚ฐ ํ˜น์€ CUDA ์ปค๋„ ์‹คํ–‰๊ณผ ์ค‘์ฒฉ์‹œํ‚จ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด HUM์„ ํ†ตํ•ด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๊ทธ๋ ‡์ง€ ์•Š๊ณ  CUDA๋งŒ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด ํ‰๊ท  1.21๋ฐฐ ๋น ๋ฅธ ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋˜ํ•œ, Unified Memory๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์ตœ์ ํ™”ํ•œ ๊ฒƒ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ, DeepUM์€ UM์„ ํ™œ์šฉํ•˜์—ฌ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ๋”ฅ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. UM์„ ํ†ตํ•ด GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ดˆ๊ณผํ•ด์„œ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ CPU์™€ GPU๊ฐ„์— ํŽ˜์ด์ง€๊ฐ€ ๋งค์šฐ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ด๋™ํ•˜๋Š”๋ฐ, ์ด๋•Œ ๋งŽ์€ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ๋‘๋ฒˆ์งธ ๋ฐฉ๋ฒ•์—์„œ๋Š” correlation ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์ด ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด DeepUM์€ ๊ธฐ์กด์— ์—ฐ๊ตฌ๋œ ๊ฒฐ๊ณผ๋“ค๊ณผ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉด์„œ ๋” ํฐ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ ํ˜น์€ ๋” ํฐ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, SnuRHAC์€ ํด๋Ÿฌ์Šคํ„ฐ์— ์žฅ์ฐฉ๋œ ์—ฌ๋Ÿฌ GPU๋ฅผ ๋งˆ์น˜ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ GPU์ฒ˜๋Ÿผ ๋ณด์—ฌ์ค€๋‹ค. ๋”ฐ๋ผ์„œ, ํ”„๋กœ๊ทธ๋ž˜๋จธ๋Š” ์—ฌ๋Ÿฌ GPU๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•˜์ง€ ์•Š๊ณ  ํ•˜๋‚˜์˜ ๊ฐ€์ƒ GPU๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐํ•˜๋ฉด ํด๋Ÿฌ์Šคํ„ฐ์— ์žฅ์ฐฉ๋œ ๋ชจ๋“  GPU๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” SnuRHAC์ด Unified Memory๋ฅผ ํด๋Ÿฌ์Šคํ„ฐ ํ™˜๊ฒฝ์—์„œ ๋™์ž‘ํ•˜๋„๋ก ํ™•์žฅํ•˜๊ณ , ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋™์œผ๋กœ GPU๊ฐ„์— ์ „์†กํ•˜๊ณ  ๊ด€๋ฆฌํ•ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ, UM์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ํ”„๋ฆฌํŽ˜์นญ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด SnuRHAC์ด ์‰ฝ๊ฒŒ GPU ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํŠน์„ฑ์— ๋”ฐ๋ผ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.1 Introduction 1 2 Related Work 7 3 CUDA Unified Memory 12 4 Framework for Maximizing the Performance of Traditional CUDA Program 17 4.1 Overall Structure of HUM 17 4.2 Overlapping H2Dmemcpy and Computation 19 4.3 Data Consistency and Correctness 23 4.4 HUM Driver 25 4.5 HUM H2Dmemcpy Mechanism 26 4.6 Parallelizing Memory Copy Commands 29 4.7 Scheduling Memory Copy Commands 31 5 Framework for Running Large-scale DNNs on a Single GPU 33 5.1 Structure of DeepUM 33 5.1.1 DeepUM Runtime 34 5.1.2 DeepUM Driver 35 5.2 Correlation Prefetching for GPU Pages 36 5.2.1 Pair-based Correlation Prefetching 37 5.2.2 Correlation Prefetching in DeepUM 38 5.3 Optimizations for GPU Page Fault Handling 42 5.3.1 Page Pre-eviction 42 5.3.2 Invalidating UM Blocks of Inactive PyTorch Blocks 43 6 Framework for Virtualizing a Single Device Image for a GPU Cluster 45 6.1 Overall Structure of SnuRHAC 45 6.2 Workload Distribution 48 6.3 Cluster Unified Memory 50 6.4 Additional Optimizations 57 6.5 Prefetching 58 6.5.1 Static Prefetching 58 6.5.2 Dynamic Prefetching 61 7 Evaluation 62 7.1 Framework for Maximizing the Performance of Traditional CUDA Program 62 7.1.1 Methodology 63 7.1.2 Results 64 7.2 Framework for Running Large-scale DNNs on a Single GPU 70 7.2.1 Methodology 70 7.2.2 Comparison with Naive UM and IBM LMS 72 7.2.3 Parameters of the UM Block Correlation Table 78 7.2.4 Comparison with TensorFlow-based Approaches 79 7.3 Framework for Virtualizing Single Device Image for a GPU Cluster 81 7.3.1 Methodology 81 7.3.2 Results 84 8 Discussions and Future Work 91 9 Conclusion 93 ์ดˆ๋ก 111๋ฐ•

    Graph Locality Prefetcher for Graph Database

    Get PDF
    This work presents a hardware prefetcher to improve the performance of accessing graph data representing large and complex networks. We represent complex networks as graphs, and queries amount to traversals on the graph. Unlike conventional memory hierarchies that exploit spatial and temporal locality, we observe that graph traversals do not necessarily exhibit these same notions of locality. This results in degraded performance of the memory hierarchy. Consequently, our hardware prefetcher exploits locality that is intrinsic to graph traversals, which we call graph-locality to improve the performance of the memory hierarchy. We design and evaluate our prototype using a micro-architectural simulator, and deploy benchmarks from GDBench that is oriented to evaluate the performance of graph database systems.1 yea

    Holistic Performance Analysis and Optimization of Unified Virtual Memory

    Get PDF
    The programming difficulty of creating GPU-accelerated high performance computing (HPC) codes has been greatly reduced by the advent of Unified Memory technologies that abstract the management of physical memory away from the developer. However, these systems incur substantial overhead that paradoxically grows for codes where these technologies are most useful. While these technologies are increasingly adopted for use in modern HPC frameworks and applications, the performance cost reduces the efficiency of these systems and turns away some developers from adoption entirely. These systems are naturally difficult to optimize due to the large number of interconnected hardware and software components that must be untangled to perform thorough analysis. In this thesis, we take the first deep dive into a functional implementation of a Unified Memory system, NVIDIA UVM, to evaluate the performance and characteristics of these systems. We show specific hardware and software interactions that cause serialization between host and devices. We further provide a quantitative evaluation of fault handling for various applications under different scenarios, including prefetching and oversubscription. Through lower-level analysis, we find that the driver workload is dependent on the interactions among application access patterns, GPU hardware constraints, and Host OS components. These findings indicate that the cost of host OS components is significant and present across UM implementations. We also provide a proof-of-concept asynchronous approach to memory management in UVM that allows for reduced system overhead and improved application performance. This study provides constructive insight into future implementations and systems, such as Heterogeneous Memory Management

    Grid-enabling Non-computer Resources

    Get PDF

    DSPatch: Dual Spatial Pattern Prefetcher

    Full text link
    High main memory latency continues to limit performance of modern high-performance out-of-order cores. While DRAM latency has remained nearly the same over many generations, DRAM bandwidth has grown significantly due to higher frequencies, newer architectures (DDR4, LPDDR4, GDDR5) and 3D-stacked memory packaging (HBM). Current state-of-the-art prefetchers do not do well in extracting higher performance when higher DRAM bandwidth is available. Prefetchers need the ability to dynamically adapt to available bandwidth, boosting prefetch count and prefetch coverage when headroom exists and throttling down to achieve high accuracy when the bandwidth utilization is close to peak. To this end, we present the Dual Spatial Pattern Prefetcher (DSPatch) that can be used as a standalone prefetcher or as a lightweight adjunct spatial prefetcher to the state-of-the-art delta-based Signature Pattern Prefetcher (SPP). DSPatch builds on a novel and intuitive use of modulated spatial bit-patterns. The key idea is to: (1) represent program accesses on a physical page as a bit-pattern anchored to the first "trigger" access, (2) learn two spatial access bit-patterns: one biased towards coverage and another biased towards accuracy, and (3) select one bit-pattern at run-time based on the DRAM bandwidth utilization to generate prefetches. Across a diverse set of workloads, using only 3.6KB of storage, DSPatch improves performance over an aggressive baseline with a PC-based stride prefetcher at the L1 cache and the SPP prefetcher at the L2 cache by 6% (9% in memory-intensive workloads and up to 26%). Moreover, the performance of DSPatch+SPP scales with increasing DRAM bandwidth, growing from 6% over SPP to 10% when DRAM bandwidth is doubled.Comment: This work is to appear in MICRO 201

    A Survey of Techniques for Architecting TLBs

    Get PDF
    โ€œTranslation lookaside bufferโ€ (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

    Optimizing Virtual Machine I/O Performance in Cloud Environments

    Get PDF
    Maintaining closeness between data sources and data consumers is crucial for workload I/O performance. In cloud environments, this kind of closeness can be violated by system administrative events and storage architecture barriers. VM migration events are frequent in cloud environments. VM migration changes VM runtime inter-connection or cache contexts, significantly degrading VM I/O performance. Virtualization is the backbone of cloud platforms. I/O virtualization adds additional hops to workload data access path, prolonging I/O latencies. I/O virtualization overheads cap the throughput of high-speed storage devices and imposes high CPU utilizations and energy consumptions to cloud infrastructures. To maintain the closeness between data sources and workloads during VM migration, we propose Clique, an affinity-aware migration scheduling policy, to minimize the aggregate wide area communication traffic during storage migration in virtual cluster contexts. In host-side caching contexts, we propose Successor to recognize warm pages and prefetch them into caches of destination hosts before migration completion. To bypass the I/O virtualization barriers, we propose VIP, an adaptive I/O prefetching framework, which utilizes a virtual I/O front-end buffer for prefetching so as to avoid the on-demand involvement of I/O virtualization stacks and accelerate the I/O response. Analysis on the traffic trace of a virtual cluster containing 68 VMs demonstrates that Clique can reduce inter-cloud traffic by up to 40%. Tests of MPI Reduce_scatter benchmark show that Clique can keep VM performance during migration up to 75% of the non-migration scenario, which is more than 3 times of the Random VM choosing policy. In host-side caching environments, Successor performs better than existing cache warm-up solutions and achieves zero VM-perceived cache warm-up time with low resource costs. At system level, we conducted comprehensive quantitative analysis on I/O virtualization overheads. Our trace replay based simulation demonstrates the effectiveness of VIP for data prefetching with ignorable additional cache resource costs
    • โ€ฆ
    corecore