Spectral-domain CNNs have been shown to be more efficient than traditional spatial CNNs in terms of reducing computation complexity. However they come with a 'kernel explosion' problem that, even after compression (pruning), imposes a high memory burden and off-chip bandwidth requirement for kernel access. This creates a performance gap between the potential acceleration offered by compression and actual FPGA implementation performance, especially for low-latency CNN inference. In this paper, we develop a principled approach to overcoming this performance gap and designing a low-latency, low-bandwidth, spectral sparse CNN accelerator on FPGAs. First, we analyze the bandwidth-storage tradeoff of sparse convolutional layers and locate communication bottlenecks. We then develop a dataflow for flexibly optimizing data reuse in different layers to minimize off-chip communication. Finally, we propose a novel scheduling algorithm to optimally schedule the on-chip memory access of multiple sparse kernels and minimize read conflicts. On a state-of-the-art FPGA platform, our design reduces data transfers by 42% with DSP utilization up to 90% and achieves inference latency of 9 ms for VGG16, compared to the baseline state-of-the-art latency of 68 ms.
INTRODUCTION
Convolutional Neural Networks (CNNs) [11] [19] [21] [8] are a popular choice for FPGA acceleration due to their widespread utility ACM acknowledges that this contribution was authored or co-authored by an employee, contractor, or affiliate of the United States government. As such, the United States government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for government purposes only. FPGA '20, February 23-25, 2020, Seaside, CA, USA © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7099-8/20/02. . . $15.00 https://doi.org /10.1145/3373087.3375302 in tasks such as classification, detection [18] and segmentation [13] . Accelerating CNNs on current hardware platforms brings about several major challenges: (1) model weights in the convolutional and fully-connected layers, as well as intermediate activations impose a large memory overhead; (2) convolution operations use largescale floating-point computations, performance is thus bounded by available computing resources; (3) large data transfers between on-and off-chip memory impacts latency, throughput and total power consumption. These issues become even more critical for edge devices with only limited memory and computing capacity.
Given the large number of potentially redundant operations, compression (of the CNN model) can significantly reduce memory and computation overheads, and so is a widely accepted technique for improving efficiency. Among such methods, pruning can deliver more than 20× compression [28] , while still maintaining high accuracy. Quantization supports traditional model operations and data access patterns, but could suffer from high accuracy drop. Another alternative is frequency domain transformation; [26] [25] convert CNNs to the spectral domain to accelerate computation without sacrificing accuracy. These come at a cost -enlarged spectral kernels and complex numbers require vastly more memory and communication bandwidth, causing spectral CNNs to be communicationbounded rather than computation-bounded. To alleviate these issues in spectral CNNs, [15] uses an ADMM [2] based pruning method to compress spectral convolutional layers and achieve a high (uniform) compression ratio (∼ 4×) for all kernels.
Given these advances in compression efficiency, the major challenge (for both spatial and spectral CNNs) now becomes addressing the performance gap -how to port these efficiency gains across to the actual hardware platform implementation. This requires careful design-space exploration and optimization. For example, while the uniform (across kernels) compression ratio of [15] reduces load imbalance issues in sparse tensor operation, the problem of irregular data access still remains. This performance gap becomes especially critical for CNNs used in real-time machine learning applications on edge devices, such as face recognition [1] , autonomous driving [23] etc. In particular, computation and communication implementation overheads, unless mitigated through careful design optimizations for effectively utilizing bandwidth and minimizing latency, can severely impact performance under low-latency requirements [3] Motivated by the challenge of overcoming the performance gap between model compression and efficient low-latency hardware implementation on FPGAs, this paper proposes a more efficient architecture for spectral convolution with sparse kernels to minimize data transfers and required bandwidth, as well as inference latency. We bridge this gap via a novel dataflow coupled with a (bounded, approximately-optimal) scheduling algorithm for irregular data access. We summarize our major contributions below:
• We provide a comprehensive complexity analysis of spectral convolutional layers, focusing on the on-chip storage versus bandwidth tradeoff. This analysis enables us to effectively identify the most critical bottleneck in each layer. • We propose a flexible dataflow for choosing the optimal data reuse strategy in different spectral convolutional layers that minimizes data transfers and off-chip communication. To support this flexibility, we also design a unified architecture that can adjust dataflow on the fly. • Leveraging the fact that multiple sparse kernels access the same input, we design an approximate exact-cover based scheduling algorithm to optimally schedule on-chip memory access with minimum read conflicts. For a given a number of input image replicas, we use our schedule to perform spectral convolution on each input tile in a minimum of clock cycles. Our scheduling algorithm assumes no pattern constraints on sparse kernels and is therefore generalizable to any set of sparse kernel inputs. • We implement our design on a state-of-the-art FPGA platform. Using our flexible dataflow, data transfers are reduced by 42% for VGG16. With our scheduling algorithm, DSP utilization reaches up to 90% with only 10 input replications for 64% parallel kernels (on 4× compressed VGG16).
Compared to the baseline inference latency of 68 ms for processing all sparse spectral convolutional layers [15] (at a bandwidth of 9GB/s), our design only needs 9 ms in total with a bandwidth of 12 GB/s. When baseline latency is scaled to match our performance, the required bandwidth becomes 58GB/s higher, far beyond single DDR capacity. Though our principal focus is on reducing latency, we are still able to achieve high throughput (112 fps).
RELATED WORK
Model pruning [7] [9][14] [28] aims to reduce storage for kernels, which removing redundant and insignificant connections in both convolutional and fully-connected layers. These methods either prune single connection [7] [28] , or remove a group of connections, such as the whole channel [9] , or the whole kernel [14] . Model Quantization [6] [5] [10] [12] compresses models by encoding floatpoint weights or activations into short-bit representations. Quantization can reduce computation complexity by converting float-point operation into integer or bit operation, while still holding regular dataflow that is friendly to hardware implementation.
Besides model compression, hardware implementation also assumes a key role in accelerating CNNs' deployment. Current hardware platforms either fail to efficiently parallelize large-scale convolution/matrix operations [20] , or have a fixed memory and computing hierarchy which cannot fit all CNN models [4] . Facing these limitations in current hardware platforms, domain-specific architecture emerges as an effective solution for accelerating CNN inference. FPGA based architectures [17] [27] [22] [26] [25] achieve very high throughput or low latency by tailoring its configurable resources (DSPs/BRAMs/LUTs) to certain model. In this way, each model can be quickly deployed into FPGA platforms. Along with design space exploration, performance and resources can be wellbalanced. Among these works, [17] adopts kernel-shaped PE array which processes all operations in one kernel simultaneously. High throughput on targeting model (VGG16) can be obtained, but at the cost of less configurability for different model with different kernel sizes. [27] views convolutional and fully-connected layers as uniformed matrix-multiplication process. In this way, a unified micro-architecture is proposed, in which dataflow for convolutional and fully-connected layers can be separately optimized to maximize FPGA computing and bandwidth utilization. Being aware of a lot of data reuses in convolutional layers, [22] proposes a systolic-array architecture which fully exploit data reuses by enabling communication among multiple processing elements (PEs).
Different from previous works, [26] [25] convert spatial convolution into spectral domain. By first tiling input images, then doing Fast Fourier Transform (FFT) on each input tile, they convert computational-intensive convolution operations into lightweighted Hadamard operations in spectral domain. The advantage is two fold. First, data flow in convolutional layers is being simplified with Hadamard product. Then, computation complexity is reduced without sacrificing accuracy. For example, for VGG16, computation complexity can be reduced by 3× under input tile size of 8 × 8. However, these two methods come with the serious issue of kernel memory explosion in spectral domain. For VGG16, by converting 3 × 3 spatial kernel to 8 × 8 complex-number kernel in spectral domain, storage increases by almost 15×. This issue become even more urgent if we use larger FFT size, which cause convolutional layer to be memory-bounded instead of computationbounded. As a result, reduction in computation complexity cannot be fully exploited in hardware platforms due to high memory and communication overhead.
BACKGROUND
Spatial CNNs, as we call, are convolutional neural networks that consist stacked conventional convolutional layers, followed by activation functions (like ReLU), then usually end with fully-connected layers. Convolutional layers are used to abstract features from inputs and combine them to further higher-level representations [24] . Given input activations X ∈ R b×c in ×h in ×w in , and kernels W ∈ R c out ×c in ×k ×k , outputs Y ∈ R b×c out ×h out ×w out can be calculated as:
where b denotes the batch size, c in and c out denotes #channels for input and output activations. h in ,w in ,h out and w out indicate spatial dimension of input and output activations. " * " is the convolution operator 1 .
Fully-connected layers only consist normal matrix-vector operations with flattened input batch X ∈ R b×M , which compute output activations Y ∈ R b×N . Given kernel W ∈ R M ×N , Y is obtained as show in Eq (2) .
where M and N denotes the length of input and output activations. Compared with fully-connected layers, convolutional layers account for most computations. For example, in VGG16, 99% computations fall into convolutional layers. On the other side, convolutional layers only have 10% of total parameters, which brings spatial convolutional layers to be bounded by available computing resources.
Spectral CNNs are the models that convert convolution in spatial domain into Hadamard product in spectral domain (frequency domain) with Fast Fourier Transform (FFT) [26] . To reduce complexity of FFT and simplify hardware implementation, it first breaks up inputs X into small tiles X t,s ∈ R b×c in ×h ′ in ×w ′ in , where t, s denote the position in the whole image for each single tile,
Then given window size K = h in ′ + k − 1, it applies FFT on these small tiles, getting the spectral-domain counterpart X t,s ∈ R b×c in ×K ×K . After applying spectral kernels W ∈ R c out ×c in ×K ×K on each tile, and accumulating along the dimension of input channels, we get spectral-domain output tiles Y t,s ∈ R b×c out ×K ×K , as shown in Eq (3).
where "•" denotes Hadamard product.
Finally, we use Invert Fast Fourier Transform (IFFT) to convert these tiles back to spatial domain, getting Y t,s ∈ R b×c out ×K ×K . To get spatial-domain activations for further non-linear operation (like ReLU, Pooling), we need to concatenate these tiles. Due to the sliding window operation of convolution, these output tile Y t,s have some overlaps on the boundary. Hence, an operation called Overlap-and-Add (OaA) operation is applied to add and merge the overlap regions [26] .
where " " denotes operations that add and merge output tiles in the same image. Converting CNNs into spectral domain reduces computation complexity in most convolutional layers. For VGG16, with FFT size K = 8, computation complexity can be reduce by 2× without any accuracy degradation [26] . Furthermore, Hadamard operation simplifies the dataflow in which sliding window on each input tiles can be avoided.
FPGA Acceleration for CNNs takes advantage of FPGAs to parallelize computations in CNNs and customize corresponding dataflow. Due to intrinsic parallel architecture in FPGA, convolution and matrix operations can be efficiently mapped into DSP arrays on FPGAs. The main focus on FPGA acceleration is to customize memory hierarchy so that data movement can be minimized between onand off-chip memory and maximized in local memory buffer. Doing this, PEs can be guaranteed to reach peak performance during computing. Unlike ASICs, required on-chip memory in FPGA is mapped into the array of Block RAMs (BRAMs), in which each BRAM can store 36Kb data 2 . The logic of controlling dataflow and computing is mapped into Look-Up Table ( LUT) arrays in FPGA. Each LUT can be configured as various gate function. Hence, given available 2 36Kb BRAM is the basic unit in Xilinx devices, for Intel FPGA, it will be 20Kb resources above (N DS P DSPs, N BRAM BRAMs, N LU T LUTs, as well as off-chip bandwidth BW sys ), FPGA designs are aimed to optimize certain objective function T under these resource constraints
where n DS P , n BRAM , n LU T and bw are the required resources in the design. The objective function T can be to maximize throughput, minimizing latency, or minimizing power consumption, etc. Nowadays, since CNN compression techniques (pruning and quantization) are playing a key role in edge devices, FPGAs become even more important due to its powerful configurability. For example, pruned models require dedicated dataflow to address irregular data access and load imbalance, which makes the FPGA be a perfect platform to design specialized memory hierarchy to fully exercise potentials in compressed CNNs.
COMPLEXITY ANALYSIS
Given enlarged spectral kernels, even in compressed spectral CNNs, memory and communication have totally different patterns compared with spatial CNNs. In edge devices, due to limited on-chip memory and bandwidth, also low-latency requirement, trade-off between buffering data and off-chip communication becomes nontrivial. Hence conducting a complexity analysis can help locate the primary bottleneck and explore new trade-offs in each spectral convolutional layer.
The complexity analysis is based on a general architecture model, in which data (inputs, outputs, kernels) are originally stored in offchip memory, then streamed into on-chip memory. For each spectral convolutional layer, computing engine first fetches tiles of inputs and kernels, then computes Hadamard product and accumulation, finally writes output tiles to off-chip memory, as shown in Fig. 1 . In the analysis, we assume each K ×K spectral kernel is compressed by α, in which only K 2 α values are non-zeros. In this context, we analyse the complexity from two aspects: On-chip storage, communication bandwidth.
On-chip storage
In most cases, on-chip memory resources are not capable of keeping all kernels/activations. The common technique is to tile kernels/activations, calculating partial outputs each time. In next round, we keep some kernels/activations on chip to be reused and bring some new data. Hence, different data reuse approaches require different number of BRAMs. In spectral convolutional layers, we can choose to reuse input tiles/kernels/output partial sums.
To reuse input tiles, we fix input activations on chip until they serve all kernels, while kernels are grouped to be fed into on-chip buffers. In doing this, since kernel buffers are overwritten by another group of kernels, we have to re-load them when starting operation on new input tiles. On the contrary, reusing kernels leads to an opposite dataflow in which parts of kernels are kept on chip, streaming input tiles instead. Reusing output partial sums is to Session: Deep Learning II FPGA '20, February 23-25, 2020, Seaside, CA, USA keep partial sums on chip until all input channels of current tile are finished, which avoids writing and re-loading partial sums back and forth.
To give a general analysis, suppose we parallelize N ′ kernels, M ′ input channels, and P ′ input tiles, given the fact that each BRAM can only support one concurrent access, each parallel line (kernel, input channels, input tiles) needs at least one BRAM to support parallel accesses. Besides, we also need to write partial sums to output buffer in parallel. In addition, if kernels are sparse, we need some replicas (r ) for each input tile to support accesses to the same input tile from multiple kernels (See Sec. 5.3).
In the case of reusing kernels and output partial sums (Flow #1), we keep kernels and partial sums on chip, streaming all input tiles, the required BRAMs will be at least
where "1024" indicates memory depth for single BRAM, and "K 2 " is output tile size.
On the other side, if we choose to reuse input tiles and output partial sums, streaming kernels (Flow #2), the required BRAMs will be at least
If we reuse input tiles and kernels, streaming partial sums (Flow #3), n BRAM can be obtained as follows:
where α is the compression ratio.
For some layers, n BRAM can be very large in certain flow, even beyond the system capacity, which are detailed in Sec 5.2. Therefore, through this analysis, we can accurately describe if certain flow is bounded by BRAMs. On the other hand, we can also find out which flow needs minimum number of BRAMs.
Communication bandwidth
Different dataflows bring different communication overhead between on-and off-chip memory. Suppose for certain spectral convolutional layer i, the latency is set to be τ i (s),the lower bound of the required bandwidth is: bw = #Data transfers τ i . Off-chip communication consists of reading inputs and kernels, writing outputs, writing and re-loading partial sums. Given three dataflow above, we formulate the required bandwidth as follows:
In Flow #1, input buffer is overwritten by new input tiles after each round. If we start computation with new group of kernels, we have to re-load corresponding input tiles. Given latency τ i in layer i, the required bandwidth is:
where " N N ′ " means the number of times for re-loading input tiles. In Flow #2, kernel buffer is overwritten instead. Compared with Flow #1, the main communication rests on re-loading kernels.
where " h in w in P ′ h in ′ w in ′ " means the number of times for re-loading kernels.
Different with Flow #1 and Flow #2, Flow #3 needs to write partial sums to off-chip memory and re-load them if needed, which stands out as the main communication overhead.
where "2 · M M ′ " means the number of times for writing/reading partial sums.
Combining these two analyses, each layer's overhead of both onchip storage and communication for each dataflow is brought to the surface. Knowing the overhead, we can find an optimal dataflow, even a hybrid one which combines different flow strategies, to minimize hardware overhead, as detailed in Sec. 5.2.
ARCHITECTURE DESIGN 5.1 Overview
According to previous analysis, each spectral convolutional layer might prefer certain dataflow to make an optimal trade-off between memory and communication overhead. To support this flexibility, we design a unified architecture that can adjust dataflow on the fly, as shown in Fig. 1 . Streaming controller decides what data (inputs, kernels, partial sums) should be reused or streamed based on each layer's configurations (Sec. 5.2). Also, irregular data access can be well-solved by joint optimization of scheduling algorithm and supportive hardware implementation (Sec. 5.3).
The whole process is: all inputs, kernels, and outputs are originally stored in DDR memory; once FPGA kernel starts, input tiles and kernels are fed into on-chip buffer before enabling process elements (PEs); then, these tiles are converted into spectral domain by 2D FFT (kernels are already in spectral representation); then we start all parallel PEs, calculating Hadamard product and accumulating partial sums along input channels; after all input channels of current tiles are done, output tiles are converted back to spatial domain using 2D IFFT before writing into DDR memory. To reduce latency, we choose to process multiple input tiles and kernels simultaneously, and process input channels in a serial manner (M ′ = 1) so that write conflicts can be avoided. In each round, one input tile can be reused by all N ′ parallel kernels, while each kernel can be reused by P ′ tiles.
Flexible dataflow
Keeping a fixed dataflow across all layers cannot guarantee data transfers are minimized. Based on the analysis in Sec. 4, the required on-chip memory and communication bandwidth differs among dataflows. For example, Fig. 2 shows required on-chip memory and off-chip bandwidth in three dataflows for compressed VGG16 (α = 4): Flow #1 (streaming input tiles), Flow #2 (streaming kernels), and Flow #3 (streaming partial sums).
It is easy to see that streaming input tiles (reuse kernels and partial sums) though has fewer data transfers, it requires huge number of BRAMs. On the other side, streaming kernels needs fewer BRAMs but poses higher communication overhead. Another interesting thing comes with streaming partial sums which brings no advantages at all due to its huge amount of data transfers. Similar observations also hold in other compressed networks. Inspired by this, we propose an optimization algorithm to flexibly decide #kernels should be processed before flushing corresponding input tiles, and #input tiles before flushing corresponding kernels. We call these two parameters streaming parameters: N s (kernels) and Ps (input tiles). Then the required BRAMs is at least:
The required bandwidth is at least:
Considering different architecture parameters (P ′ , N ′ ) also lead to different performance, to get the optimal trade-off, we design a heuristic optimization method to find optimal settings for both architecture and streaming parameters, as shown in Alg. 1. Given a compressed model, Alg. 1 does a heuristic search in architecture parameter space, as well as in streaming parameter space. In each layer, we fix architecture parameters (P ′ , N ′ ), trying all possible streaming parameters (Ps, N s). In each streaming setting, we choose the one with minimum required BRAMs, then calculate its bandwidth bw. We update the optimal streaming parameter if the resulting bandwidth bw is less than previous minimum bw min . After iterating all layers, we register the maximum bandwidth bw max as bandwidth requirement in current architecture setting. Finally, we choose the pair (P ′ , N ′ ) and (Ps, N s) with minimum bw max given a network model.
Streaming controller in Fig. 1 is designed to adjust dataflow in different layers. Streaming options are managed by a internal state machine, as shown in Fig. 3 . Each time when spectral convolution is done (DONE CONV), the state machine first checks if the number of processed kernel reaches N s . If not (!N s ), it continue reading kernels (READ KERNEL). Otherwise, when all input channels of current kernels haven't been done (!M s ), two cases can arise: first, for (Ps, Ns) in all possible streaming parameters do 5: n BRAM 1 ← #BRAMs needed for Flow #1 6: n BRAM 2 ← #BRAMs needed for Flow #2 7: n BRAM 3 ← #BRAMs needed for Flow #3 8: Choose the flow with minimum #BRAMs 9: n BRAM ← min(n BRAM 1 , n BRAM 2 , n BRAM 3 )
10:
if n BRAM < N BRAM then 11: bw ← Calculate bandwidth (Eq. 13) 12: if bw < bw min then ▷ Current streaming 13: bw min ← bw 14: Update steaming parameter (Ps, N s) if bw max < bw arch then ▷ Current arch 21: bw arch ← bw max 22: Update arch parameter (P ′ , N ′ ) 23: end if 24: end for P s input image tiles have been finished, for which we need to flush all kernels and input tiles, loading input tiles and kernels from new input channel; second, we are still on the way of processing P s input tiles in current input channel, for which we need to load new input tiles, while kernels are already loaded. Another situation is that M s input channels for current N s kernels and P s tiles are done, it then starts IFFT operation (RPOC IFFT) to convert output image tiles back into spatial domain, then write to off-chip memory (WRITE OUT). At this time, we check if all kernels and input tiles are done, if not (!(N &P)), we go back to read input tiles and kernels, otherwise, we complete current spectral convolutional layer (DONE).
Memory access scheduling
Memory access scheduling comes with the fact that parallel sparse kernels assume different access patterns to input tiles. As shown in Fig. 4 which is a typical case of memory access in our design, in which multiple kernels are processed in parallel, while weight values in each kernel are streamed in serial. In each cycle, each PE sends a read address to the input BRAM. After getting the input, PE multiplies it with a corresponding kernel value, then accumulate with the one in partial sums buffer (Fig. 1) . Given N ′ parallel kernels, at worst, one input BRAM needs to provide N ′ different values in a single clock cycle, which violates BRAM's principle of single access.
To prevent multiple kernels from starving for no available inputs, we use r replicas to increase throughput of input BRAMs. And Session: Deep Learning II FPGA '20, February 23-25, 2020, Seaside, CA, USA noting that some kernels might access the same location, which can be served by one BRAM, the number of replicas can be smaller than N ′ . Given N ′ sparse kernels and r replicas for each input tile, in order to minimize clock cycles of Hadamard product on current N ′ kernels and P ′ tiles, we design a novel scheduling algorithm to optimally group read requests.
Our scheduling arises from the notion that the order of processing each value in a kernel does not matter. As long as the corresponding indices come along with the value to specify where to write corresponding result into, we can rearrange values in current kernel group in any order. The scheduling algorithm is aimed to group access addresses so that the number of distinct addresses in each read cycle is less than r , while the number of cycles of finishing current group of sparse kernels reach minimum. Given N ′ sparse kernels, in which each kernel is represented by a format of (val, index) | 0 ≤ index ≤ K 2 . The number of indices in each kernel exactly equals to K 2 α . Therefore, the scheduling can be obtained by solving the following problem: given a matrix M of size N ′ × K 2 /α where each row stores indices of the same kernel, rearrange values in each row to minimize
is the number of distinct indices in the i t h column. we refrain from rearranging within the columns as it may result in memory conflicts in the output.
We first transform the representation into a bipartite graph, as shown in Fig. 5 , in which each connection between kernel node and index node means the kernel has non-zero value in corresponding index ID x .
Based on this bipartite graph, we can construct a unit scheduling set s i = (KR x i , ID x i ), ..., (KR y i , ID y i ) that meets following constraints: (C1) elements in each set have distinct kernel nodes KR x , which reflects the fact that only one value in each kernel can be processed in each cycle; (C2) the number of different indices ID x cannot exceed r , which enables r replicas to feed all selected Session: Deep Learning II FPGA '20, February 23-25, 2020, Seaside, CA, USA input tiles. Ideally, we can find all welcome sets S = {s 1 , s 2 , ..., s i , ...}. Then, to get an optimal scheduling which has minimum clock cycles, we need to find the set collection S * ⊂ S which has the minimum number of sets. This is exactly an exact cover problem.
Given huge number of possible set s i and exact-cover problem is NP-complete, we choose to use a greedy algorithm to efficiently approximate the optimal solution, as shown in Alg. 2. In each step, we find a set s i that greedily leads to optimal solution. Two cases come up during searching: First, if there are a set collection S ′ covering all kernel nodes, then we need to decide which sets to be chosen so that it causes minimal adverse effects on following optimization. We choose to leave as many high-degree index nodes as possible for future search since it's easier to find a set that can cover most kernels if we have free high-degree index nodes. Therefore, we choose the set s i that has lower-degree index nodes, leaving highdegree nodes untouched. In the second case, if there is no such set covering all kernel nodes, we will choose the set s i that covers as many kernel nodes as possible, which is aimed to maximize current PEs utilization. Find set collection S that meet above constraint C1, C2
4:
if there are sets S ′ covering all kernel nodes then 5:
for set s i in S for set s i in S do 11: Choose s i which cover most kernel nodes in 12: G(KR, ID) 13: end for 14: end if 15:
Push selected set s i into S *
16:
Delete edges in selected set s i from G(KR, ID) 17: end while After getting the scheduling list, to efficiently access values in kernel and replica buffers, we break S * into two parts, INDEX table  and Value table. For each s i ∈ S * , we collect the unique indices, storing them into INDEX table; we store weights and corresponding selection signal sel into VALUE table, as shown in Fig. 6 . During hardware implementation, we first read these indices (rep 0 ,rep 1 ) and value tables, then we use these indices to read necessary inputs at current cycle, then we use sel signal in value tables to route inputs into corresponding PEs. Furthermore, in some cases, some kernels might be inactive due to too many unique addresses in current cycle. Therefore, we use valid signal to indicate whether we should feed certain kernels or not. 
DESIGN EVALUATION
The fundamental objective in this paper is to reduce communication overhead brought by enlarged spectral kernels and minimize read conflicts when multiple sparse kernels access input tile in one single BRAM. In this section, we use VGG16 to show: (1) how our dataflow optimization technique achieves an optimal trade-off between bandwidth and BRAM usage; (2) how exact-cover based scheduling method improves PE utilization. Sec. 6.1 covers detailed results of BRAM usage and communication bandwidth using various dataflow techniques. Then in Sec. 6.2, we employ kernels with different patterns to show exact-cover based scheduling method delivers higher PE utilization. FPGA implementation results are shown in Sec. 6.3. We use a heterogeneous CPU-FPGA platform, Xilinx Alveo U200, in which operations like OaA [26] , ReLU, Pooling, fully-connected layers are offloaded to CPU, while FPGA is dedicated to spectral convolutional layers. 16-bit fix-point (FX) number is adopted during computing. RTL code is synthesized and implemented in Xilinx Vivado 2018.3, then integrated into OpenCL environment.
Data transfers
We use the amount of transferred data between on-and off-chip memory as the metric to compare communication overhead in different dataflows. Required bandwidth can be easily obtained given latency τ i on each layer.
Table1 shows the optimal architecture and streaming parameters for VGG16 given compression ratio α = 4. Since the first layer (conv1_1) has negligible computations, we omit it during dataflow optimization. Given these parameters for each layer, and based on previous complexity analysis in Sec. 4 and Sec. 5.2, we can compare this optimized flow Flow opt with Flow #1 (streaming kernels) and Flow #2 (streaming input tiles) in terms of BRAM usage and data transfers. Fig. 7 shows complexities of different dataflows for VGG16 with K = 8 and α = 4. Compare with Flow #1 and Flow #2, Flow opt transfers minimal amount of data in almost all layers, while still keeping moderate BRAM usage. Though Flow opt consumes more BRAMs than Flow #1 design due to the fact that more partial sums needs to be stored, it significantly reduce times of reloading input tiles or kernels during computing. It accordingly relieves the system of much pressure in transferring data. Spectral VGG16 with K = 16 also delivers similar improvements. However, since the model with 16 × 16 spectral kernels needs 4× more storage for kernels, even with our optimized dataflow, it still causes huge communication overhead. Hence, during hardware implementation, we will choose 8 × 8 spectral kernels.
The required bandwidth can be easily calculated given latency budget τ i in each layer i. We first assume the platform has enough bandwidth during computing, then τ i is determined based on each layer's computation complexity. Suppose the total latency budget for convolutional layers is τ , then for each layer τ i = τ × C M P i C M P total , where CPM i denotes multiplications and additions in layer i. Table2 gives the required bandwidth given total inference latency τ = 20 ms 3 . Even with enlarged spectral kernels, the whole design still 
PE utilization
When evaluating the scheduling algorithm, we use PE utilization µ i to show the average number active PEs in each layer i. Given r replicas for each input image tile, since the total number of multiplications and additions are fixed, higher PE utilization means less number of processing cycles in each convolutional layer. We first define PE utilization µ i in layer i as:
where T denotes the total convolution cycles, each cycle contributed by Hadamard product with P ′ input tiles and N ′ kernels; PE on t and PE total are the number of working PEs on cycle t, and the total number of PEs on chip.
To compare, we also implement two baseline methods: random scheduling and lowest-index first scheduling [15] . random scheduling randomly chooses both a kernel and a non-zero weight index in this kernel, then continues randomly choosing other kernels and indices until either all kernels are included or the number of unique indices reaches r . On the other hand, lowest-index first scheduling always picks the kernels with lowest index in the current group. It iterates this operation until the same stopping condition is triggered. Fig.8 shows PE utilization in each layer of VGG16 using above three scheduling methods with the number of replicas r = 8, and N ′ = 64. Exact-cover based scheduling method achieves the highest PE utilization. Moreover, it also gives a consistent performance across all convolutional layers, while lowest-index first scheduling deeply relies on the condition that indices in different kernels are close, like kernels in layer conv5_2, conv5_3.
To show how PE utilization changes with the number of replicas r under different scheduling methods, we vary the number of replicas r from 4 to 20. Fig.9 shows the average PE utilization under different compression ratios, in which the average PE utilization is obtained by weighting each layer's utilization with its total computations. Scheduling method in this paper reaches very high PE utilization (> 80%) given fewer number of replicas compared with lowestindex first method. Even under ×8 compression, in which indices are largely scattered from 0 to K 2 − 1, exact-cover based scheduling still achieves > 80% PE utilization given 10 replicas for each input tiles. Lowest-index first method, on the contrary, needs 16 replicas to reach comparable performance. Apart from sparse kernels generated by algorithm in [15] , robustness to more general sparsity patterns is also critical. We generate sparse kernels from uncompressed spectral model by randomly choose K 2 α non-zero weights while zeroing other values. We try various compression ratios, as well as different number of replicas r . Fig. 10 shows the average PE utilization for these three scheduling methods. Exact-cover based methods always outperforms lowest-index first scheduling. In addition, even with random sparsity pattern, exact-cover based methods still achieves comparable PE utilization as in ADMM-based pruned kernels when α = 4.
FPGA implementation
During implementation, we choose VGG16 as our target model fed with 224 × 224 images, in which compression ratio α = 4, corresponding accuracy is 95.0%. Architecture parameters P ′ , N ′ are set to be 9 and 64, while streaming parameters in each layer are the Figure 9 : Average PE utilization for VGG16 given different number of replicas same as Table1. Based on the scheduling analysis, we choose the number of replicas r to be 10 to get > 90% PE utilization. We use the state-of-the-art FPGA, Xilinx Alveo U200, as our target platform, which has 6840 DSPs and 2160 RBAMs. URAM (288Kb each) is introduced recently as a global on-chip memory, which we didn't consider in our analysis. However, our dataflow analysis can be still generalized to this new memory, for which we can translate offchip data transfers as communication between global buffers and local buffers. Table3 shows the implementation result of our whole architecture, as well as performance comparison with other works. With using 40% DSPs and 68% BRAMs, we still achieve 200MHz clock frequency. Given N ′ = 64 and P ′ = 9, the inference in all convolutional layers is done within 9 ms, while the required bandwidth is lowered to 12 GB/s. [26] and [25] are aimed to accelerate Session: Deep Learning II FPGA '20, February 23-25, 2020, Seaside, CA, USA Figure 10: Average PE utilization for random non-zeros given different number of replicas uncompressed spectral CNNs on FPGAs. In terms of latency, we reduce the inference time by 46% after simply scaling the resources utilization. Compared with [15] , though it has higher throughput, it can only process images in batches. Single-image inference latency is 68ms, which is 7.5× higher than the design in this paper. If we scale its latency to 9ms, the required bandwidth explodes to almost 70GB/s, which is far beyond single DDR's capacity. [16] is a design for sparse spatial CNNs. Since it uses an old device, to give a fair comparison, we also assume it can be deployed in Alveo U200, while accessing the same resources. Under 200MHz, we can still get near 40% latency improvement.
To intuitively show resource utilization in the FPGA device, we give a FPGA footprint after implementation, as shown in Fig. 11 . Our design take up almost 70% of the total area and most routing resources, which explains why we don't further increase BRAM and DSP usage. 
CONCLUSION
Pruned spectral CNNs can reduce computation complexity without accuracy loss but have a kernel storage explosion problem. We develop a principled design approach to fully exploit the potential of sparse spectral CNNs and overcome the performance gap between pruning and FPGA implementation: first, we analyze the bandwidthstorage tradeoff of sparse convolutional layers and locate bottlenecks. We then adopt a flexible dataflow for different layers to minimize communication overhead. To ameliorate the performance degradation brought on by irregular on-chip memory accesses of sparse kernels, we design an approximate exact-cover based scheduling algorithm to efficiently feed PEs with few replicas. Finally, we implement a unified hardware architecture which can adjust flexible dataflow on the fly and maximize PE utilization. On a Xilinx Alveo U200 platform, with 64 parallel kernels and 9 parallel input tiles, inference takes 9ms, while only requiring 12GB/s off-chip bandwidth.
ACKNOWLEDGEMENTS
This work was supported in part by the National Science Foundation under grants CNS-1643351 and CCF-1919117.
