Search CORE

5 research outputs found

FfDL : A Flexible Multi-tenant Deep Learning Platform

Author: Egwutuoha Ifeanyi P.
Google Inc.
Hermann Jeremy
Jia Yangqing
Kraska Tim
Pan Xinghao
Park Jun Woo
Tantawi Asser N.
Venkataraman Shivaram
Wang Chao
Xiao Wencong
Zaharia Matei
Zhang Haoyu
Zhang K.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/09/2019
Field of study

Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various deep learning models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including unanticipated faults, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.Comment: MIDDLEWARE 201

arXiv.org e-Print Archive

Crossref

X86 AI Performance Bottlenecks and Resolution Using Memory Prefetching

Author: Lee Chia-Hang
Publication venue
Publication date: 07/02/2023
Field of study

The proliferation of AI applications to various devices, such as smartphones, electric cars, robots, etc., has led to the need for optimized price-performance solutions. GPUs and specialized ASIC are implemented inside the device to optimize the performance by delivering targeted hardware solutions. Although these solutions have good performance, they are not cost-effective. In comparison, CPUs are readily available in higher volumes and lower prices, but there has been limited research into the bottlenecks they face when executing AI workloads [1]. The purpose of this research is to identify these CPU bottlenecks. AI workloads operate in many frameworks, including Tensorflow, Caffe, Torch, Theano, Neon, Deep learning, CNTK. Gem5 is a simulator that offers an environment to run these AI frameworks on X86 or ARM. Gem5 can collect extensive data but needs to spend a significant amount of time to simulate. We can reduce the simulation time by using another simulator called Champsim. Champsim is the trace-based simulator used for microarchitecture design. Champsim does not have extensive data collection capabilities but can perform higher-level simulations fast, while maintaining reasonable accuracy. Thus, we can combine both simulators and optimize data collection as well as simulation time i.e. Gem5 is used for data collection, and Champsim is used for simulation. In the thesis, the main framework is the deep neural network called multilayer perception because multilayer perception is one of the most commonly used frameworks. Gem5 simulator will record the performance of AI workload in the neural network and transform the recorded version into traces readable by the Champsim simulator. We then install different prefetchers to improve the performance of the AI workload inside the Champsim simulator and compare the improvements to see which prefetcher is the best one

Texas A&M Repository