Deep neural networks (DNNs) have been proving the effectiveness in various computing fields. To provide more efficient computing platforms for DNN applications, it is essential to have evaluation environments that include assorted benchmark workloads. Though a few DNN benchmark suites have been recently released, most of them require to install proprietary DNN libraries or resource-intensive DNN frameworks, which are hard to run on resource-limited mobile platforms or architecture simulators. To provide a more scalable evaluation environment, we propose a new DNN benchmark suite that can run on any platform that supports CUDA and OpenCL. The proposed benchmark suite includes the most widely used five convolution neural networks and two recurrent neural networks. We provide architectural statistics of these networks while running them on an architecture simulator, a server-and a mobile-GPU, and a mobile FPGA.
I. INTRODUCTION
Deep neural network (DNN) has regained huge attention from industry and academia recently. Thanks to the birth of massively-parallel computing platforms such as GPUs and specialized hardware accelerators, complex cognitive applications powered by DNN provide real-time responses with high prediction accuracy. To design an efficient DNN accelerator, it is essential to have an assorted DNN workloads that can be used for evaluating various performance aspects of the architecture. Recently, a few DNN benchmark suites have been developed. However, most of them require installation of DNN frameworks and proprietary DNN libraries, which are hard to be deployed on platforms and architecture simulators that do not support the libraries due to insufficient system resources or compatibility issue.
In this paper, we present a new DNN benchmark suite, Tango, which does not require DNN framework or proprietary library installation. Tango provides inference code of a set of most widely used convolutional neural networks (CNNs) and recurrent neural networks (RNNs) written in CUDA C and OpenCL. Therefore, Tango supports any architecture and software simulator that can run CUDA and OpenCL applications. We provide architectural characteristics of individual networks by running them on a GPU architecture simulator (GPGPU-Sim), a server GPU (NVIDIA GK210), a mobile GPU (NVIDIA TX1), and an FPGA (Xilinx PynQ-Z1).
II. NEURAL NETWORKS IN TANGO
To provide representative DNN workloads, Tango supports the most widely used DNNs: CNN and RNN. CNN is mainly used for the applications that need to extract patterns from image inputs such as face recognition while RNN extracts information from time-series inputs, such as stock price forecasting. For each type DNN, we implemented a few famous reference models, which will be explained shortly. To evaluate both intra-layer and inter-layer characteristics of individual networks, we implemented the entire structure of the target networks. We used pre-trained model weight files as inputs of individual layers. The information can be found from the benchmark suite repository.
A. Convolutional Neural Network (CNN)
CifarNet [1] is developed to recognize objects over CIFAR-10 and CIFAR-100 database. It consists of three convolutional layers and two fully connected layers.
AlexNet [2] is the first CNN that proved the efficiency and accuracy of CNN-based object recognition. AlexNet consists of five convolutional layers and three fully-connected layers.
ResNet [3] was developed by Microsoft and has various versions with different number of layers. We developed ResNet-50 that uses 50 layers.
SqueezeNet [4] is designed for supporting embedded platforms. To reduce the model size, SqueezeNet defines f iremodule that consists of a squeeze convolution layer and an expand layer.
VGGNet [5] uses either 16 or 19 layers, where each layer uses very small (3 × 3) convolution filters. We implemented 16-layer VGGNet.
B. Recurrent Neural Network (RNN)
Long Short Time Memory (LSTM) [6] is the most widely used RNN. Input, Output and Forget are the three types of gates used in LSTM, which enable the network to forget/remember a cell state.
Gated Recurrent Unit (GRU) [7] is a variation of LSTM which aims at solving the vanishing gradient problem. GRU uses two gates, Reset and Update, which make it simpler than LSTM models. Among many statistics, we present those that are hard to be measured by the existing benchmark suites such as performance impact of various on-chip memory sizes and the performance and energy comparison of different accelerators.
A. Stall Cycle Breakdown
We collected stall cycles by running a profiler, nvprof, on an NVIDIA GK210 GPU as shown in Figure 1a . We observed that there are clear patterns that indicate individual layer types. Fully-connected layers suffer from memory throttling more than the other layers. Convolution and normalization layers encounter more stalls due to unavailable pipelines. Pooling layers show higher stall rates due to data dependency than the other layers. These patterns well describe individual layer types. For example, fully-connected layers typically use large data to compute the activation of all features. Thus, the fully-connected layers use higher memory resources such as MSHRs and hence the execution is suspended if all provided memory resources are used up. Convolution and normalization layers typically use more neurons than the other layers, which make the arithmetic operation pipelines busy, thereby throttled by unavailable pipelines. Pooling layers summarize the convolution results either with maximum or average values, which requires repeated comparison of many input data, which leads to high data dependency. GRU and LSTM show similar patterns with convolution layers and pooling layers of CNNs, respectively. We believe LSTM encounters more data dependency due to more complex structure than GRU.
B. Performance Impact of On-chip Cache
There has not been a study that shows the performance impact of various L1D size because most of the existing studies have been evaluated on real GPUs, which is hard to reconfigure the cache size. We evaluated RNNs and CNNs on GPGPU-Sim while varying L1D size from zero to 4×64KB as shown in Figure 1b . Note that 64KB is the default L1D size of NVIDIA Pascal GPU. The total execution times with three different size L1Ds are normalized by the execution time of when L1D is bypassed (marked as No L1). RNNs do not show performance improvement with larger L1Ds. It is an expected result because RNNs use relatively small input data and there is not much repeatedly accessed data. On the other hand, most of the CNNs show a significant performance improvement with larger on-chip caches. It is also an expected result because CNNs inherently have a lot of redundant data accesses. For example, the same convolution feature maps are used by all neurons in the same layer and neighboring neurons use overlapping input data. With the great performance improvement by CNNs, the execution time is reduced by 10% on average when employing 64KB larger L1Ds across the networks.
C. Power Efficiency of Different Platforms
We evaluated energy efficiency of different platforms. Figure 1c shows the normalized energy consumption of an embedded GPU (NVIDIA TX1) and an embedded FPGA (Xilinx PynQ) for CifarNet and SqueezeNet. We measured the peak power consumption by using a Wattsup power meter and then calculated the energy consumption by applying the total execution time. TX1 showed 2 to 3× higher peak power consumption than PynQ. This is an expected result because TX1 is equipped with more hardware resources and runs general-purpose pipeline while PynQ's pipeline is dedicatedly programmed for each network. However, the execution times of the two networks on TX1 were almost 2× shorter than on PynQ because of slower code loading time and smaller on-chip memory size of PynQ. Therefore, the overall energy consumption of the two networks on TX1 was around 1.5× higher than PynQ.
