Search CORE

667 research outputs found

Characterizing Deep-Learning I/O Workloads in TensorFlow

Author: Chien Steven W. D.
Herman Pawel
Laure Erwin
Markidis Stefano
Narasimhamurthy Sai
Santos Luis
Sishtla Chaitanya Prasad
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/10/2018
Field of study

The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to restart quickly from a checkpoint. Because of this, I/O affects the performance of DL applications. In this work, we characterize the I/O performance and scaling of TensorFlow, an open-source programming framework developed by Google and specifically designed for solving DL problems. To measure TensorFlow I/O performance, we first design a micro-benchmark to measure TensorFlow reads, and then use a TensorFlow mini-application based on AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow. To improve the checkpointing performance, we design and implement a burst buffer. We find that increasing the number of threads increases TensorFlow bandwidth by a maximum of 2.3x and 7.8x on our benchmark environments. The use of the tensorFlow prefetcher results in a complete overlap of computation on accelerator and input pipeline on CPU eliminating the effective cost of I/O on the overall performance. The use of a burst buffer to checkpoint to a fast small capacity storage and copy asynchronously the checkpoints to a slower large capacity storage resulted in a performance improvement of 2.6x with respect to checkpointing directly to slower storage on our benchmark environment.Comment: Accepted for publication at pdsw-DISCS 201

arXiv.org e-Print Archive

메모리 스왑 패턴 분석을 통한 스왑 시스템 최적화

Author: 정혜린
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2021. 2. 염헌영.The use of memory is one of the key parts of modern computer architecture (Von Neumann architecture) but when considering limited memory, it could be the most lethal part at the same time. Advances in hardware and software are making rapid strides in areas such as Big Data, HPC and machine learning and facing new turning points, while the use of memory increases along with those advances. In the server environment, various programs share resources which leads to a shortage of resources. Memory is one of those resources and needs to be managed. When the system is out of memory, the operating system evicts some of the pages out to storage and then loads the requested pages in memory. Given that the storage performance is slower than the memory, swap-induced delay is one of the critical issues in the overall performance degradation. Therefore, we designed and implemented a swpTracer to provide visualization to trace the swap in/out movement. To check the generality of the tool, we used mlock to optimize 429.mcf of Spec CPU 2006 based on the hint from swpTracer. The optimized program executes 2 to 3 times faster than the original program in a memory scarce environment. The scope of the performance improvement with previous system calls decreases when the memory limit increases. To sustain the improvement, we build a swap- prefetch to read ahead the swapped-out pages. The optimized application with swpTracer and swap-prefetch consistently exceeds the performance of the original code by 1.5x.메모리의 사용은 현대 컴퓨터 아키텍처(폰 노이만 아키텍쳐)의 핵심 부분 중 하 나이기 때문에, 메모리가 부족한 환경은 성능에 치명적인다. 하드웨어와 소프트웨 어의 발전으로 빅데이터, HPC, 머신러닝과 같은 분야들이 빠른 속도로 발전하여 그에 따라 메모리의 사용량도 증가한다. 따라서, 메모리가 제한된 임베디드 환경 이나, 여러 작업이 동시에 수행되는 서버에서 메모리 부족으로 작업이 중단되는 문제가 발생한다. 시스템이메모리가부족하면운영체제는일부페이지를저장소로내보낸다음 요청된 페이지를 메모리에 로드한다. 스토리지 성능이 메모리보다 느리다는 점에 서 스왑에 의한 지연은 전반적인 성능 저하의 중요한 문제 중 하나이다. 따라서 스왑이 프로그램 수행 시간에 영향을 미치지 않도록 프로그램의 스왑 발생 추이를 분석하여 스왑 발생을 줄일 수 있도록 힌트를 주는 도구인 swpTracer를 설계, 실 행했다. mlock을 사용하여 Spec CPU 2006 벤치마크 중 429.mcf에 적용했을 때 기존 프로그램 대비 2, 3 배 성능이 빨라졌다. 기존의 시스템 콜을 사용하여 최적화했을 때 메모리가 살짝 부족한 경우에는 비슷한성능을보여주지만, 메모리가 50% 부족한순간부터성능향상폭이줄었다. 이를 보완하기 위해 스왑 아웃 되었던 페이지를 미리 읽어두는 swap-prefetch를 구현했다. 배열을 3번 횡단하는 프로그램을 대상으로 배열의 크기를 조절하면서 swap-prefetch의 성능을 시험했다. 원본 코드와 시스템 함수인 madvise를 사용 했을 때보다 평균적으로 1.5 좋아졌다. 또, swap-prefetch를 다른 시스템 함수를 사용했을 때와 mlock과 비교했을 때 평균 1.25배 성능이 빨라졌다.Abstract Chapter 1 Introduction 1 Chapter 2 Background 4 2.1 Page Reclamation Policy . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Linux Swap Management . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Linux System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 3 Design and Implementation 8 3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2.1 Kernel Level . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2.2 Application Level . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 4 Evaluation 15 4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2.1 Generality of swpTracer . . . . . . . . . . . . . . . . . . . 16 4.2.2 Memory Optimization Method Comparison . . . . . . . . 17 Chapter 5 Related Work 20 Chapter 6 Conclusion 22 Bibliography 초록 28Maste

Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques

Author: Kaiser Hartmut
Khatami Zahra
Ramanujam J.
Publication venue
Publication date: 27/03/2017
Field of study

Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided with a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and producing desired application scalability. One solution to address this challenge is the use of runtime methods. This strategy can be implemented by delaying certain amount of code analysis to be done at runtime. In this research, we improve the parallel application performance generated by the OP2 compiler by leveraging HPX, a C++ runtime system, to provide runtime optimizations. These optimizations include asynchronous tasking, loop interleaving, dynamic chunk sizing, and data prefetching. The results of the research were evaluated using an Airfoil application which showed a 40-50% improvement in parallel performance.Comment: 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017

arXiv.org e-Print Archive

Computing server power modeling in a data center: survey,taxonomy and performance evaluation

Author: Ismail Leila
Materwala Huned
Publication venue
Publication date: 30/04/2020
Field of study

Data centers are large scale, energy-hungry infrastructure serving the increasing computational demands as the world is becoming more connected in smart cities. The emergence of advanced technologies such as cloud-based services, internet of things (IoT) and big data analytics has augmented the growth of global data centers, leading to high energy consumption. This upsurge in energy consumption of the data centers not only incurs the issue of surging high cost (operational and maintenance) but also has an adverse effect on the environment. Dynamic power management in a data center environment requires the cognizance of the correlation between the system and hardware level performance counters and the power consumption. Power consumption modeling exhibits this correlation and is crucial in designing energy-efficient optimization strategies based on resource utilization. Several works in power modeling are proposed and used in the literature. However, these power models have been evaluated using different benchmarking applications, power measurement techniques and error calculation formula on different machines. In this work, we present a taxonomy and evaluation of 24 software-based power models using a unified environment, benchmarking applications, power measurement technique and error formula, with the aim of achieving an objective comparison. We use different servers architectures to assess the impact of heterogeneity on the models' comparison. The performance analysis of these models is elaborated in the paper

arXiv.org e-Print Archive

가상화 환경을 위한 원격 메모리

Author: 조창연
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·컴퓨터공학부, 2021.8. Bernhard Egger.클라우드 환경은 거대한 연산 자원을 상시 가동할 필요 없고 원하는 순간 원하는 양의 대한 연산 비용만을 지불하면 되기 때문에, 최근 인공지능 및 빅데이터 연산의 유행으로 인해 그 수요가 크게 증가하고 있다. 이러한 클라우드 컴퓨팅의 도입으로인해 고객은 서버 유지에 대한 비용을 크게 절감할 수 있고 서비스 제공자는 연산 자원의 이용 효율을 극대화 할 수 있다. 이러한 시나리오에서 데이터센터 입장에서는 연산 자원 활용 효율을 개선하는 것이 중요한 목표가 된다. 특히 최근 폭증하고 있는 데이터 센터의 규모를 고려하면 작은 효율 개선으로도 막대한 경제적 가치를 창출 할 수 있다. 데이터 센터의 효율은 위치 선정, 구조 설계, 냉각 시스템, 하드웨어 구성 등등 다양한 요소들에 영향을 받지만, 이 논문에서는 특히 연산 및 메모리 자원을 관리하는 소프트웨어 설계 및 구현을 다룬다. 본 논문에서는 데이터 센터 효율 개선을 획기적으로 개선하는 두가지 소프트웨어 기반 기술을 제안한다. 첫 째로 가상화 환경을 위한 소프트웨어 기반 메모리 분리 시스템을 제안한다. 최근 고속 네트워크의 발전으로 인해 원격 메모리 접근 비용이 획기적으로 줄어 들었고, 이 논문에서는 고성능 네트워킹 하드웨어를 이용하여 원격 메모리 위에서 실행되는 가상 머신의 큰 성능 저하 없이 실행할 수 있음을 보인다. 제안된 기술을 QEMU/KVM 가상머신 하이퍼바이저를 통해 평가한 결과, 본 논문에서 제안한 기법은 기존 시스템 대비 원격 페이징에 대한 꼬리 지연시간을 98.2% 개선함을 보인다. 또한 랙 규모의 작업처리 시뮬레이션을 통한 실험에서, 제안된 시스템은 전체 작업 처리 시간을 기존 시스템 대비 40.9% 줄일 수 있음을 보인다. 두 번째로 원격 메모리를 이용하는 즉각적인 가상머신 이주 기법을 제안하다. 가상화 환경의 원격 메모리 활용에 대한 확장은 그것만으로 자원 이용률 향상에 대해 큰 기여를 하지만, 여전히 한 서버에서 여러 어플리케이션이 경쟁적으로 자원을 이용하는 경우 성능이 크게 저하 될 수 있다. 이 논문에서 제안하는 즉각적인 가상머신 이주 기법은 원격 메모리 상에서 아주 작은 메타데이터의 전송만으로 가상머신의 이주를 가능하게 하며, 메모리 상에 키와 값을 저장하는 데이터베이스 벤치마크를 실행하는 가상머신을 기반으로 한 평가에서 기존 기법대비 실질적인 서비스 중단시간을 최대 92.6% 개선함을 보인다.The raising importance of big data and artificial intelligence (AI) has led to an unprecedented shift in moving local computation into the cloud. One of the key drivers behind this transformation was the exploding cost of owning and maintaining large computing systems powerful enough to process these new workloads. Customers experience a reduced cost by renting only the required resources and only when needed, while data center operators benefit from efficiency at scale. A key factor in operating a profitable data center is a high overall utilization of its resources. Due to the scale of modern data centers, small improvements in efficiency translate to significant savings in the total cost of ownership (TCO). There are many important elements that constitute an efficient data center such as its location, architecture, cooling system, or the employed hardware. In this thesis, we focus on software-related aspects, namely the utilization of computational and memory resources. Reports from data centers operated by Alibaba and Google show that the overall resource utilization has stagnated at a level of around 50 to 60 percent over the past decade. This low average utilization is mostly attributable to peak demand-driven resource allocation despite the high variability of modern workloads in their resource usage. In other words, data centers today lack an efficient way to put idle resources that are reserved but not used to work. In this dissertation we present RackMem, a software-based solution to address the problem of low resource utilization through two main contributions. First, we introduce a disaggregated memory system tailored for virtual environments. We observe that virtual machines can use remote memory without noticeable performance degradation under moderate memory pressure on modern networking infrastructure. We implement a specialized remote paging system for QEMU/KVM that reduces the remote paging tail-latency by 98.2% in comparison to the state of the art. A job processing simulation at rack-scale shows that the total makespan can be reduced by 40.9% under our memory system. While seamless disaggregated memory helps to balance memory usage across nodes, individual nodes can still suffer overloaded resources if co-located workloads exhibit high resource usage at the same time. In a second contribution, we present a novel live migration technique for machines running on top of our remote paging system. Under this instant live migration technique, entire virtual machines can be migrated in as little as 100 milliseconds. An evaluation with in-memory key-value database workloads shows that the presented migration technique improves the state of the art by a wide margin in all key performance metrics. The presented software-based solutions lay the technical foundations that allow data center operators to significantly improve the utilization of their computational and memory resources. As future work, we propose new job schedulers and load balancers to make full use of these new technical foundations.Chapter 1. Introduction 1 1.1 Contributions of the Dissertation 3 Chapter 2. Background 5 2.1 Resource Disaggregation 5 2.2 Transparent Remote Paging 7 2.3 Remote Direct Memory Access (RDMA) 9 2.4 Live Migration of Virtual Machines 10 Chapter 3. RackMem Overview 13 3.1 RackMem Virtual Memory 13 3.2 RackMem Distributed Virtual Storage 14 3.3 RackMem Networking 15 3.4 Instant VM Live Migration 16 Chapter 4. Virtual Memory 17 4.1 Design Considerations for Achieving Low-latency 19 4.2 Pagefault handling 20 4.2.1 Fast-path and slow-path in the pagefault handler 21 4.2.2 State transition of RackVM page 23 4.3 Latency Hiding Techniques 25 4.4 Implementation 26 4.4.1 RackMem Virtual Memory Module 27 4.4.2 Dynamic Rebalancing of Local Memory 29 4.4.3 RackVM for Virtual Machines 29 4.4.4 Running Unmodified Applications 30 Chapter 5. RackMem Distributed Virtual Storage 31 5.1 The distributed Storage Abstraction 32 5.2 Memory Management 33 5.2.1 Remote memory allocation 33 5.2.2 Remote memory reclamation 33 5.3 Fault Tolerance 34 5.3.1 Fault-tolerance and Write-duplication 34 5.4 Multiple Storage Support in RackMem 36 5.5 Implementation 38 5.5.1 The Remote Memory Backend 38 5.5.2 Linux Demand Paging on RackDVS 39 Chapter 6. Networking 40 6.1 Design of RackNet 40 6.2 Implementation 41 6.2.1 RPC message layout 41 6.2.2 RackNet RPC Implementation 42 Chapter 7. Instant VM Live Migration 44 7.1 Motivation 45 7.1.1 The need for a tailored live migration technique 45 7.1.2 Software Bottlenecks 46 7.1.3 Utilizing workload variability 46 7.2 Design of Instant 47 7.2.1 Instant Region Migration 47 7.3 Implementation 48 7.3.1 Extension of RackVM for Instant 49 7.3.2 Instant region migration 49 7.3.3 Pre-fetch optimizations 51 7.3.4 Downtime optimizations 51 7.3.5 QEMU modification for Instant 52 Chapter 8. Evaluation - RackMem 53 8.1 Execution Environment 54 8.2 Pagefault Handler Latency 56 8.3 Single Application Performance 57 8.3.1 Batch-oriented Applications 58 8.3.2 Internal Pagesize and Performance 59 8.3.3 Write-duplication overhead 60 8.3.4 RackDVS slab size and performance 62 8.3.5 Latency-oriented Applications 63 8.3.6 Network Bandwidth Analysis 64 8.3.7 Dynamic Local Memory Partitioning 66 8.3.8 Rack-scale Job Processing Simulation 67 Chapter 9. Evaluation - Instant VM Live Migration 69 9.1 Experimental setup 69 9.2 Target Applications 70 9.3 Comparison targets 70 9.4 Database and client setups 71 9.5 Memory disaggregation scenarios 71 9.6.1 Time-to-responsiveness 71 9.6.2 Effective Downtime 73 9.6.3 Effect of Instant optimizations 75 Chapter 10. Conclusion 77 10.1 Future Directions 78 요약 89박

Improved Designs for Application Virtualization

Author: Hung Chung-Ping
Publication venue: Washington University Open Scholarship
Publication date: 24/05/2012
Field of study

We propose solutions for application virtualization to mitigate the performance loss in streaming and browser-based applications. For the application streaming, we propose a solution which keeps operating system components and application software at the server and streams them to the client side for execution. This architecture minimizes the components managed at the clients and improves the platform-level incompatibility. The runtime performance of application streaming is significantly reduced when the required code is not properly available on the client side. To mitigate this issue and boost the runtime performance, we propose prefetching, i.e., speculatively delivering code blocks to the clients in advance. The probability model on which our prefetch method is based may be very large. To manage such a probability model and the associated hardware resources, we perform an information gain analysis. We draw two lower bounds of the information gain brought by an attribute set required to achieve a prefetch hit rate. We organize the probability model as a look-up table: LUT). Similar to the memory hierarchy which is widely used in the computing field, we separate the single LUT into two-level, hierarchical LUTs. To separate the entries without sorting all entries, we propose an entropy-based fast LUT separation algorithm which utilizes the entropy as an indicator. Since the domain of the attribute can be much larger than the addressable space of a virtual memory system, we need an efficient way to allocate each LUT\u27s entry in a limited memory address space. Instead of using expensive CAM, we use a hash function to convert the attribute values into addresses. We propose an improved version of the Pearson hashing to reduce the collision rate with little extra complexity. Long interactive delays due to network delays are a significant drawback for the browser-based application virtualization. To address this, we propose a distributed infrastructure arrangement for browser-based application virtualization which reduces the average communication distance among servers and clients. We investigate a hand-off protocol to deal with the user mobility in the browser-based application virtualization. Analyses and simulations for information-based prefetching and for mobile applications are provided to quantify the benefits of the proposed solutions

Washington University St. Louis: Open Scholarship