5 research outputs found
FfDL : A Flexible Multi-tenant Deep Learning Platform
Deep learning (DL) is becoming increasingly popular in several application
domains and has made several new application features involving computer
vision, speech recognition and synthesis, self-driving automobiles, drug
design, etc. feasible and accurate. As a result, large scale on-premise and
cloud-hosted deep learning platforms have become essential infrastructure in
many organizations. These systems accept, schedule, manage and execute DL
training jobs at scale.
This paper describes the design, implementation and our experiences with
FfDL, a DL platform used at IBM. We describe how our design balances
dependability with scalability, elasticity, flexibility and efficiency. We
examine FfDL qualitatively through a retrospective look at the lessons learned
from building, operating, and supporting FfDL; and quantitatively through a
detailed empirical evaluation of FfDL, including the overheads introduced by
the platform for various deep learning models, the load and performance
observed in a real case study using FfDL within our organization, the frequency
of various faults observed including unanticipated faults, and experiments
demonstrating the benefits of various scheduling policies. FfDL has been
open-sourced.Comment: MIDDLEWARE 201
X86 AI Performance Bottlenecks and Resolution Using Memory Prefetching
The proliferation of AI applications to various devices, such as smartphones, electric cars, robots, etc., has led to the need for optimized price-performance solutions. GPUs and specialized ASIC are implemented inside the device to optimize the performance by delivering targeted hardware solutions. Although these solutions have good performance, they are not cost-effective. In comparison, CPUs are readily available in higher volumes and lower prices, but there has been limited research into the bottlenecks they face when executing AI workloads [1]. The purpose of this research is to identify these CPU bottlenecks.
AI workloads operate in many frameworks, including Tensorflow, Caffe, Torch, Theano, Neon, Deep learning, CNTK. Gem5 is a simulator that offers an environment to run these AI frameworks on X86 or ARM. Gem5 can collect extensive data but needs to spend a significant amount of time to simulate. We can reduce the simulation time by using another simulator called Champsim. Champsim is the trace-based simulator used for microarchitecture design. Champsim does not have extensive data collection capabilities but can perform higher-level simulations fast, while maintaining reasonable accuracy. Thus, we can combine both simulators and optimize data collection as well as simulation time i.e. Gem5 is used for data collection, and Champsim is used for simulation.
In the thesis, the main framework is the deep neural network called multilayer perception because multilayer perception is one of the most commonly used frameworks. Gem5 simulator will record the performance of AI workload in the neural network and transform the recorded version into traces readable by the Champsim simulator. We then install different prefetchers to improve the performance of the AI workload inside the Champsim simulator and compare the improvements to see which prefetcher is the best one