# fpgaHART: A toolflow for throughput-oriented acceleration of 3D CNNs for HAR onto FPGAs

Petros Toupas<sup>1,2</sup>

Christos-Savvas Bouganis<sup>1</sup>

Dimitrios Tzovaras<sup>2</sup>

<sup>1</sup> Dpt. of Electrical and Electronic Engineering Imperial College London Email: {p.toupas21,christos-savvas.bouganis}@imperial.ac.uk <sup>2</sup> Information Technologies Institute Centre of Research and Technology Hellas Email: {ptoupas,dimitrios.tzovaras}@iti.gr

Abstract-Surveillance systems, autonomous vehicles, human monitoring systems, and video retrieval are just few of the many applications in which 3D Convolutional Neural Networks are exploited. However, their extensive use is restricted by their high computational and memory requirements, especially when integrated into systems with limited resources. This study proposes a toolflow that optimises the mapping of 3D CNN models for Human Action Recognition onto FPGA devices, taking into account FPGA resources and off-chip memory characteristics. The proposed system employs Synchronous Dataflow (SDF) graphs to model the designs and introduces transformations to expand and explore the design space, resulting in high-throughput designs. A variety of 3D CNN models were evaluated using the proposed toolflow on multiple FPGA devices, demonstrating its potential to deliver competitive performance compared to earlier hand-tuned and model-specific designs.

Index Terms—FPGA, Toolflow, 3D CNNs, Human Action Recognition

#### I. INTRODUCTION

Two-dimensional CNNs have excelled in image-related tasks in recent years. The increasing importance and amount of applications arising from video-related tasks, such as video surveillance, autonomous driving, and elderly monitoring, has demanded the development of algorithms that incorporate and account for the temporal domain. Three-dimensional CNNs are one of the most common approaches used to deal with video and volumetric data. With the addition of a new dimension, such as time or depth, 3D CNNs augment their capability to learn by extracting information related to the newly added dimension.

3D CNNs have exhibited outstanding performance, particularly in the task of Human Action Recognition (HAR). The use of 3D CNNs allows the interpretation of human motion across video frames, allowing the detection of a wide range of human actions without the requirement for specific time domain approaches like LSTMs. As can be seen in Figure 1, 3D CNNs dominate the pareto front in one of the most widely used HAR benchmarks, Kinetics-400, while the recent emergence of vision transformers has also begun to drive some designs to the pareto front, however such networks require orders of magnitude additional GFLOPs to operate.

While 3D CNNs are capable of capturing time or depthrelated features, the additional dimension of the input fre-



Fig. 1: Kinetics-400 pareto is dominated by 3D-CNNs for small number of parameters. Demonstrating the deployability of 3D-CNNs on edge devices with limited resources.

quently results in greater workloads, computational and memory requirements compared to 2D CNNs. Numerous hardware devices, including GPUs, FPGAs, and ASICs, have been used to mitigate for the 3D CNNs' high processing requirements and provide high performing systems. The current work aims to design systems that can be deployed to FPGA devices, due to their flexibility in adapting to the requirements of such evolving field as well as with their potential for achieving high performance and low power consumption.

In HAR, given a single input video clip, N new clips are generated by shifting a (fixed) time window throughout the original clip's duration, and M new clips are generated by cropping an area (for each image in the clip). The final evaluation of the original clip is acquired by passing each of the  $N \times M$  generated clips through the HAR model and averaging their predictions. As such, upon deployment of such models, it is necessary to process the input video segment multiple times to maintain the desired performance. Therefore, throughput-oriented designs and solutions are of high interest. The key contributions of this paper are the following:

• Introduction of fpgaHART. A throughput-oriented toolflow for optimising and mapping 3D CNNs to FPGAs, supporting a variety of models and devices,

while taking into account the model characteristics, available platform resources, and memory bandwidth characteristics.

- The expansion of the SDF graph model used for capturing performance requirements in CNN mapping to streaming architectures to explicitly handle irregular blocks with branching, which are commonly utilised in modern 3D CNN HAR models.
- A comprehensive evaluation, utilising various devices and models, including cutting-edge 3D CNN HAR models that have yet to be explored. The findings lay the groundwork for the computation of HAR models on FPGAs for throughput-oriented applications.

#### II. BACKGROUND

Although 3D CNNs have been around for a while, there have only been a few papers aimed at their acceleration on FPGAs. The majority of these works focus on relatively old 3D CNNs, such as the C3D [1] model, whose performance falls short of state-of-the-art models. Fan et al. introduced a series of works on 3D CNN acceleration for HAR on FPGA systems, [2]–[4]. In their initial work [2], they proposed the F-C3D hardware architecture for the acceleration of C3D [1], which is capable of supporting multiple 3D convolutional layers and design strategies for overcoming the challenges associated with 3D CNNs while also allowing their design to be ported to other FPGA devices. In their subsequent work [3], they proposed an analytical model and a tool for optimising the hardware architecture based on the device specification and accuracy requirements, as well as the use of block floating point (BFP) arithmetic precision to minimise accuracy loss and the need for retraining the model. In their most recent work, [4], they proposed E3DNet, an efficient 3D CNN based on their proposed 3D-1 bottleneck building block. Their hardware implementation of E3DNet, named F-E3D, is capable of real-time performance at the execution time of 35.3 ms per clip<sup>1</sup>, while achieving an accuracy of  $85.1\%^2$  on the UCF101 benchmark.

Liu et al. [6] proposed a unified hardware architecture for 2D and 3D CNN acceleration based on the observation that the computing patterns of 2D and 3D CNNs are similar. They convert CNN convolutions to matrix multiplication operations, paying close attention to memory optimizations in order to overcome the difficulties of feature map replications. Additionally, they employed an analytical model to configure the accelerators for optimal resource use. They have targeted and evaluated their design on C3D model. Shen et al. [7] followed a similar approach, developing a unified template-based architecture based on the Winograd algorithm capable of handling both 2D and 3D CNNs. Additionally, they developed an analytical technique for efficiently exploring the design space for mapping 2D and 3D CNNs on FPGA accelerators.

The authors have targeted the C3D model for the evaluation of their proposed design. Sun et al. [8] used a blockwise pruning approach to apply weight pruning to two distinct 3D CNN architectures, namely C3D and R(2+1) [9]. Their hardware design, which is based on the Alternating Direction Method of Multipliers (ADMM), together with the suggested pruning approach, enables the acceleration of 3D CNNs with low accuracy loss compared to the unpruned version. Toupas et al. [10] recently proposed a throughput-oriented hardware design for X3D, a modern and state-of-the-art 3D CNN, with an emphasis on automating model branches management. Additionally, they have recently introduced a toolflow named HARFLOW3D [11] that simplifies the mapping and optimisation of 3D CNN models on FPGA devices, delivering promising results on latency-focused applications.

The majority of research has been focused on the C3D [1] model for HAR, which was introduced in 2013. The model's architecture is rather simplistic, consisting of only sixteen consecutive layers, and it performs poorly in terms of accuracy when compared to the modern SoA models in HAR (85.2%) in UCF101 compared to 98.6% which is the current SoA). In terms of design complexity, it is comparable to the LeNet or AlexNet in the three-dimensional space. Due to the fact that the aforementioned approaches are essentially dedicated to the design of the target model, it is unclear how they may be extended, evaluated or perform in the more complicated architectures of modern state-of-the-art HAR models. This study focuses on supporting more recent 3D CNNs as well, which have a significantly larger number of layers and deviate from the sequential approach of early networks by containing branching within Resnet-like blocks.

## **III. HARDWARE-LEVEL INTERPRETATION**

This section discusses hardware-level 3D CNN model interpretation and modelling. The work is inspired by fpgaConvNet [12], a framework that automatically maps 2D CNN models to FPGA platforms, and extends it in significant ways, as outlined below. The proposed framework extracts the parameters of each layer in a Directed Acyclic Graph (DAG) and the connections between layers from a high-level description of a 3D CNN model. The network's supported layers are mapped to parametrisable hardware building blocks that implement their functionality. Subsequently, the framework generates the network's Synchronous Data-Flow Graph (SDFG) by mapping the DAG nodes to their hardware equivalent blocks and adding them as nodes and arcs in the SDFG. Finally, using the SDF computation model, a network configuration's SDFG node performance is estimated. The sections below describe the proposed tool's components.

## A. 3D CNN layers as DAG nodes

The description of a neural network model supplied by highlevel frameworks such as pytorch and onnx is comprised by three main parts. First, the layers and their connections that define the model's structure and flow. Second, each layer's special attributes and configuration, and finally the actual

 $<sup>^1\</sup>mathrm{A}$  clip is defined as a stacked sequence of frames that are meant to be the input of the 3D CNN

 $<sup>^2\</sup>mathrm{H}.$  Duan et. al. [5] currently holds the SoA results on UCF101 achieving 98.6% accuracy

values of the learnable parameters associated with their layers (if any). A dedicated model parser is developed, parsing the above descriptions to build a DAG containing all of the relevant information of the neural network. The DAG structure is faithful to the original, retaining just the essential information from the layers' specific attributes and configuration. Additionally, the parser stores the model parameters/weights for future use during inference. Table I summarises the symbols used to denote the parameters of DAG nodes that represent and characterise the layers of the models.

TABLE I: DAG nodes parameters symbols.

| Symbols                           | Definitions                                          |
|-----------------------------------|------------------------------------------------------|
| $Sz_{i}$                          | size dimensions of the input feature map             |
| $oldsymbol{Sz}_{\circ}$           | size dimensions of the output feature map            |
| $K_{ m h}, K_{ m w}, K_{ m d}$    | height, width and depth of convolution kernel        |
| $St_{ m h}, St_{ m W}, St_{ m d}$ | stride value on height, width and depth dimensions   |
| $Pd_{ m h}, Pd_{ m W}, Pd_{ m d}$ | padding value on height, width and depth dimensions  |
| Gp                                | number of groups in which the input is split         |
|                                   | along the channel axis on convolution layers         |
| T                                 | type of activation or element-wise function          |
| M                                 | mode of element-wise operation (normal/broadcasting) |

The following data structures are utilised by the tool to capture the layers of the 3D CNN models:

- 3D Convolutional and Pooling Layers The following types of convolutional/pooling layers are supported:

   (a) spatial convolution/pooling K<sub>h</sub>×K<sub>w</sub>×1,
   (b) temporal convolution/pooling 1 × 1 × K<sub>d</sub>,
   (c) depth-wise convolution/pooling,
   (d) point-wise convolution/pooling. The configuration of the layer as stored in a DAG node is as follows, < Sz<sub>i</sub>, Sz<sub>o</sub>, K, St, Pd, Gp >, where:
  - **K** is a 3-value vector  $[K_h, K_w, K_d]$  specifying the depth, height and width of the 3D conv window.
  - St is a 3-value vector  $[St_h, St_w, St_d]$  specifying the strides of the convolution along each dimension.
  - **Pd** is a 3-value vector  $[Pd_h, Pd_w, Pd_d]$  denoting the amount of padding applied to each dimension.
- **3D** Activation Layers The activation functions supported are the following: (a) ReLu activation, (b) Sigmoid activation, (c) Swish activation which is expressed as y = x \* sigmoid(x):, with its DAG's layer structure,  $\langle Sz_i, Sz_o, T \rangle$ .
- **3D** Element-wise Layers Element-wise operations are layers that combine (add, mul) data from several branches. These layers combine several inputs into a single output, where the shapes of the inputs may or may not be identical, resulting in different functionality (normal vs broadcasting). The layers configuration as a DAG node,  $\langle Sp_{i1}, ..., Sp_{iN}, Sz_o, T, M \rangle$ .
- **3D** Global Average Pooling Layer While the standard pooling operation samples patches of the input feature map to decrease its size, GAP samples the whole feature map into a single value, creating an output vector with the same shape as the channels. DAG's layer configuration:  $\langle Sz_i, Sz_o \rangle$ .

# B. SDFG representation with branch support

To take advantage of the SDF model's capabilities, the tool maps DAG nodes into their associated hardware building blocks, which implement the functionality of each layer in the underlying hardware. Using SDF theory, the SDFG may be represented as a topology matrix  $\Gamma$ . The nodes are represented by the columns of this matrix, while the arcs that link the nodes are represented by the rows. The data consumption/production rates for each node in each arc can be inferred by looking the element at (node, arc) position in the  $\Gamma$  matrix. Positive values, by convention, drive data production, whereas negative ones drive data consumption. The element  $\Gamma(n, a) = -1$ , for example, indicates that node n consumes data at arc *a* at a rate of one.

The  $\Gamma$  matrix is decomposed into several matrices (as show in Eq. 1), allowing a more in-depth examination of each one separately and more fine control overall. The initial decomposition of the  $\Gamma$  matrix yielded three distinct matrices:

- i) The stream matrix **S**. This matrix element stores the number of incoming and outgoing parallel streams that arrive to each node's input and output.
- ii) The rate matrix  $\mathbf{R}$ . The rate matrix elements include the normalised data production and consumption rates of each node at each arc (number of elements produced/consumed per cycle). The values in this matrix range from 0 to 1.
- iii) The data matrix C. The width of each individual stream from the *S* matrix is stored in this matrix elements. Since all of the streams are assumed to have the same bit width of 16, the above matrix is not taken into consideration in this study.

$$\Gamma = S \times R \tag{1}$$

The upper bi-diagonal structure of the  $\Gamma$  matrix prevents the modelling of branching behaviours, i.e. graphs with nodes receiving multiple incoming arcs and nodes with many outgoing arcs. This work proposes and implements modifications to the SDFG structure to ease the building of graphs with several incoming or outgoing arcs at nodes, hence supporting branching models without the need to explicitly define them with static predefined layers. The depth of each side of a branch is computed to incorporate some extra buffering for the streams that are combined at the merge points in order to ensure the flow of data across the design's streams as well as to equalise the rates at the merge points.

#### C. 3D CNN layers as hardware building blocks

The hardware building blocks are the major components utilised to construct the SDFG, which will be used subsequently to estimate the network's performance. The configuration of these blocks, in conjunction with the network's topology, is utilised to automatically generate and construct the design's synthesisable Vitis HLS code. The representation of the supported hardware building blocks comprises of the following:

- i) **DAG parameters**. A set of parameters that originated from the layer's settings as a DAG node. These settings are the layers' structural configuration that cannot be changed.
- ii) **SDFG parameters**. An additional set of parameters which have an impact on the layer's performance and are the ones that the optimisation algorithm searches for during the design space exploration phase.

TABLE II: SDFG nodes parameters symbols.

| Symbols      | Definitions                                                                        |
|--------------|------------------------------------------------------------------------------------|
| si           | number of streams at the layer's input channels                                    |
| $s_{\circ}$  | number of streams at the layer's output filters                                    |
| $r_{ m i}$   | consumption rate of the layer                                                      |
| $r_{\circ}$  | production rate of the layer                                                       |
| $p_{ m mac}$ | number of parallel multiply and accumulate (MAC) operations in a convolution layer |

The hardware building block representation of the 3D CNN layers is described below:

# • 3D Convolutional/Pooling Layers:

# $< DAG_{params}, s_i, s_o, r_i, r_o, p_{mac} >$

The  $s_i, s_o$ , and  $p_{mac}$  are altered during the DSE and affect the final performance of the layer. Meanwhile the  $r_i$  and  $r_o$  depend on the  $p_{mac}$  which means they are implicitly altered during the DSE as well. A mode detailed analysis of the convolution layer and its sub-modules is provided in fpgaConvNet [12].

• 3D Activation, 3D Global Average Pooling Layers:

$$< DAG_{params}, s_i, s_o, r_i, r_o >$$

These layers'  $r_i$  and  $r_o$  can achieve consumption/production rates of 1 (if not constraint from previous layers or the memory rates), due to their element-wise functionality and the simplicity of their operations. The only exception here is the 3D Global Average Pooling, in which  $r_o = \frac{1}{H \times W \times D}$ , where *H* is the height, *W* is the width, and *D* is the depth dimension of the input feature map.

• 3D Element-wise Layers:

# $< DAG_{params}, s_{i1}, s_{i2}, s_o, r_{i1}, r_{i2}, r_o >$

This layer's  $r_{i1}$ ,  $r_{i2}$  and  $r_o$  can achieve consumption/production rates of 1 (if not constraint from previous layers or the memory rates), due to their element-wise functionality and the simplicity of their operations. It should be noted that in cases when the rates in either of the inputs are restricted owing to a lower production rate of a previous layer or due to memory constraints, the layer's input rates are equalised to the lower consumption rate among them.

# IV. DESIGN SPACE EXPLORATION

The hardware mapping of the SDFG assumes a final streaming architecture to be inferred. Each design point in the

design space has a specific combination of the involved layers' tunable parameters as they were described in section III-C. Essentially using a set of transformations operating on the SDFG, the aforementioned parameters are being altered while the design space is being explored by simulated annealing, the heuristic optimisation algorithm used in this study.

# A. 3D CNN Model Partitioning

CNN hardware architecture design incorporates two distinct approaches. Single computation engines implement a timeshared processing unit and a scheduler, while streaming architectures like the one presented employ a hardware block for each CNN layer to better exploit per layer parallelism. When trying to fit all the layers into a single design without reconfiguring the FPGA, the more layers a CNN has or the larger the input, the more FPGA resources are used, limiting each layer's parallelism.

Through utilising the FPGA's reconfiguration capabilities, network execution can be split into smaller partitions to solve this problem. By producing a unique architecture and delivering a bitstream for each partition, it allows the design of more finely tuned architectures that better fit each layer. This approach also drastically reduces off-chip memory access to only the design's input and output streams, allowing onchip memory to be used for data reuse. This strategy requires reconfiguration every time a new partition is loaded, however increasing the batch size can amortise this cost.

Beginning with a random partitioning of the model's layers by introducing L initial reconfiguration points, the optimisation process gradually modifies these partitions. The alterations to the partitions are focused on two key concepts:

- The optimizer detects partitions that limit the performance of the model because they are memory constrained, have fully exploited the parallelism of their layers, or do not have sufficient resources to exploit enough parallelism from their layers.
- Out of the candidate partitions to be modified, the optimiser selects the partitions with the lowest performance and moves layers from or to adjacent partitions with a goal of improving their performance.

Between stages that modify existing partitions, the optimiser independently executes a series of partition-specific optimisation steps based on coarse and fine transformations as detailed below.

# B. Partition-Specific Optimisations

Each partition layer's configurable parameters leverage its parallelism based on two factors:

• The number of parallel executions of coarse operations in each layer, which depends on the input feature map's channels. The primary operations of each layer can be performed in parallel by deploying multiple processing blocks up to the number of channels. 3D convolutional layers can exploit both input channels and output filters coarse-level parallelism. The  $s_i$  and  $s_o$  parameters of the hardware building block configuration are updated and searched for optimal values throughout the DSE to realise this parallelism. These variables affect the stream matrix **S**, which affects the topology matrix  $\Gamma$ , which determines design performance.

• The dot product operation's parallelism during the kernel's convolution with a given input volume piece on 3D convolutional layers. This parallelism determines the number of parallel multipliers and the depth of the adder tree for additions. A completely unrolled design uses N multipliers and N - 1 adders, yielding 1 dot product per cycle, but restricting the setup to a single multiplier and adder yields 1/N dot products per cycle, where N is the input and kernel shape. There is a trade-off between performance and resource utilisation. The DSE optimises the  $p_{mac}$  parameters of the 3D convolutional layer hardware building block configuration to accomplish this parallelism. These variables affect the rate matrix **R**, which affects the topology matrix  $\Gamma$ , which estimates design performance.

## C. Performance Modelling

To describe the performance of a given design based on the topology matrix  $\Gamma$ , an additional matrix reflecting the workload of each layer is included. As the topology matrix provides the throughput of each layer at its input in consumptions/cycle and output in productions/cycle, constructing a matrix with the total workload of each layer, i.e. the total number of elements to be consumed and produced, allows the generation of a new matrix that provides the number of cycles each layer requires to consume its workload. More specifically a workload matrix **W** has the same structure as the topology matrix  $\Gamma$ . By element-wise dividing the **W** matrix with the  $\Gamma$ matrix, the final *II* matrix is being calculated as shown below:  $II = W/\Gamma$  (2)

The II is the initiation interval matrix, and its entries represent the total number of cycles required by each layer to consume its workload completely. The maximum value of the II matrix, denoted by  $II_{max}$  determines the initiation interval of the whole SDFG. The total execution time of a partition with batch size B is given by the following equation:

$$t(B,\Gamma) = \frac{1}{\text{clock rate}} \cdot (D + II_{max} \cdot (B-1))$$
(3)

where D is the total number of cycles needed to fill the pipeline depth of the whole design, and its calculated by adding the depths of each layer and the depth added due to the extra buffering to deal with the branches in the design.

In order to capture the model's overall execution time, the execution times of each individual partition are summed up with the addition of the total reconfiguration time:

$$t_{total}(\mathbf{B}, \Gamma) = \sum_{n=0}^{N_p} t_n(B, \Gamma_i) + (N_p - 1) \cdot t_{reconfig} \qquad (4)$$

where  $N_p$  is the total number of the partitions of the model, and  $t_{reconfig}$  is the reconfiguration time for loading a partition to the FPGA. As can be noticed from Eq. 4, the extra overhead

TABLE III: 3D CNN models characteristics

| C3D              | Slowonly                                                                                               | R(2+1)D-18                                             | R(2+1)D-34                                             | X3D                                                    |
|------------------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------------|--------------------------------------------------------|--------------------------------------------------------|
| 38.61            | 54.81                                                                                                  | 8.52                                                   | 12.91                                                  | 6.97                                                   |
| 78.41            | 32.51                                                                                                  | 33.41                                                  | 63.72                                                  | 3.82                                                   |
| 27               | 174                                                                                                    | 1 82                                                   | 154                                                    | 396                                                    |
| 8                | 53                                                                                                     | I 37                                                   | 1 69                                                   | 115                                                    |
| $112 \times 112$ | $256 \times 256$                                                                                       | $+ 112 \times 112$                                     | $112 \times 112$                                       | $256 \times 256$                                       |
| 16               | 8                                                                                                      | 16                                                     | 16                                                     | 16                                                     |
| 83.2             | 94.54                                                                                                  | 88.66                                                  | 92.27                                                  | 96.52                                                  |
|                  | $\begin{array}{c} \text{C3D} \\ 38.61 \\ 78.41 \\ 27 \\ 8 \\ 112 \times 112 \\ 16 \\ 83.2 \end{array}$ | $\begin{array}{c c c c c c c c c c c c c c c c c c c $ | $\begin{array}{ c c c c c c c c c c c c c c c c c c c$ | $\begin{array}{c c c c c c c c c c c c c c c c c c c $ |

 $^{\dagger}$  FLOPs are reported as MAC operations.

caused by the device reconfiguration is proportional to the number of partitions of the final solution and is independent of the batch size. By increasing the number of batches processed by the model, the first term dominates the execution time and the cost of reconfiguration is amortised.

Finally the overall throughput of the proposed architecture is inferred by dividing the total workload of the model in GOps (Giga Operations) times the batch size, with the total execution time:

Throughput(B) = 
$$\frac{Workload_{model} * B}{t_{total}(B, \Gamma)}$$
 (5)

The design space exploration for each partition is described as an optimization problem with the following objective:  $max(t(B, \Gamma)), s.t.rsc(\Gamma) \leq rsc_{avail}$ . As this is a non-convex optimisation problem, its optimisation is based on the simulated annealing heuristic algorithm algorithm that attempts to maximise the design's throughput while ensuring that FPGA resource use does not exceed the available resources.

#### V. EVALUATION

To evaluate the performance of the tool, four state of the art 3D CNN HAR models have been selected, Slowonly, R(2+1)D-18, R(2+1)D-34, and X3D (as shown in Table III), alongside two FPGA platforms ZCU104 and ZCU102 to demonstrate the ability of the tool to target multiple 3D CNN with different workloads and network parameters on a variety of platforms. C3D model was also included to provide direct comparisons with existing works, the majority of which are hand-tuned, model-specific architectures and not toolflows. Vitis HLS and Vivado Design Suite (v21.2) were used, while the reported resource results are after place and route at 160 MHz clock frequency. The arithmetic precision used was 16-bit fixed point arithmetic with Q8.8 format. The accuracy of the HAR models is evaluated on the UCF-101, following the same strategy as prior studies [9], [13].

#### A. Modeling Accuracy Evaluation

To evaluate the quality of the performance predictor a series of experiments was conducted. Four partitions were chosen to cover the variety of produced graph structures (i.e. branch, sequential, multi-inputs, multi-outputs). The relative error was used to measure the difference between the predicted and actual latency. The relative errors for the four aforementioned types are 12.89%, 5.03%, 11.92%, and 17.32% respectively, giving a geometric mean relative error of 10.75%<sup>3</sup>. We found

 $<sup>^3</sup>$ geometric mean was used as these structures can be found simultaneously in a single 3D CNN

|                           | H. Fan [2] | H. Fan [3] | Z. Liu [6] | J. Shen [7] <sup>‡</sup> |        | M. Sun [8] |           | H. Fan [4]  | Ours   |          |            |            |        |
|---------------------------|------------|------------|------------|--------------------------|--------|------------|-----------|-------------|--------|----------|------------|------------|--------|
| Model                     | C3D        | C3D        | C3D        | C3D                      |        | C3D        | R(2+)D-18 | E3D         | C3D    | Slowonly | R(2+1)D-18 | R(2+1)D-34 | X3D    |
| GFLOPs*                   | 38.61      | 38.61      | 38.61      | -                        |        | 38.61      | 8.52      | 6.1         | 38.61  | 54.9     | 8.52       | 12.91      | 6.97   |
| Accuracy (%)              | 79.87      | 81.99      | 83.2       | 8                        | 3.2    | 83.2       | 88.66     | 85.17       | 83.2   | 94.54    | 88.66      | 92.27      | 96.52  |
| FPGA                      | ZC706      | ZC706      | VC709      | VC709                    | VUS440 | ZCU102     | ZCU102    | Intel SX660 | ZCU102 | ZCU102   | ZCU102     | ZCU102     | ZCU102 |
| clips/s <sup>†</sup>      | 1.84       | 2.09       | 8.65       | 11.18                    | 20.36  | 2.05       | 4.11      | 28.32       | 3.38   | 2.54     | 4.62       | 2.63       | 13.44  |
| GOps/s <sup>†</sup>       | 70.41      | 80.12      | 330.74     | 427.29                   | 778    | 78.44      | 111.71    | 172.8       | 130.84 | 144.44   | 39.59      | 34.26      | 85.96  |
| GOps/s/DSP <sup>†</sup>   | 0.087      | 0.103      | 0.092      | 0.281                    | 0.511  | 0.065      | 0.092     | 0.109       | 0.052  | 0.057    | 0.015      | 0.013      | 0.034  |
| Op/DSP/cycle <sup>†</sup> | 0.511      | 0.519      | 0.774      | 1.874                    | 2.559  | 0.435      | 0.613     | 0.727       | 0.325  | 0.358    | 0.098      | 0.084      | 0.213  |
| Frequency (MHz)           | 172        | 200        | 120        | 150                      | 200    | 150        | 150       | 150         | 160    | 160      | 160        | 160        | 160    |
| Precision                 | fp-16      | BFP        | fp-16      | fp-16                    | fp-16  | fp-16      | fp-16     | float-32    | fp-16  | fp-16    | fp-16      | fp-16      | fp-16  |
| DSP (%)                   | 90         | 86.6       | 99.8       | 42                       | 53     | 48         | 48        | 93.3        | 51.49  | 63.77    | 66.21      | 66.46      | 84.43  |
| BRAM (%)                  | 86.6       | 88.1       | 26.6       | 52                       | 30     | 100        | 100       | -           | 91.49  | 78.22    | 78.09      | 84.07      | 52.71  |

TABLE IV: Comparison with existing works on 3D CNN HAR models

\* FLOPs are reported as MAC operations. <sup>†</sup> Favorable batch size 100. <sup>‡</sup> The C3D model used is different/smaller version from the original one [1].



Fig. 2: Throughput (GOPs/s) of fpgaHART-generated designs on 3D CNN HAR models delivering high-throughput results on a variety of FPGA devices

that the above errors are small enough to lead to meaningful design space exploration.

#### B. Performance Comparison

The fpgaHART has been evaluated on a number of different FPGA platforms, such as the ZC706, the ZCU102, the VC706, and the VUS440. Figure 2 displays the performance in GOPs/s (with a favourable batch size of 100) of the fpgaHART-generated designs for the 3D CNN models of Table III, which details their unique characteristics, on a variety of FPGA devices. Such batch sizes are frequently encountered in practise when generating multiple views and clips over time and averaging them to improve the performance of the predictions. Even larger batch sizes may be required for multi-person HAR systems that evaluate each person's actions independently, as well as for large-scale systems that simultaneously analyse several videos.

The placement of fpgaHART in comparison to the rest of the existing works is outlined in Table IV, where the fpgaHART results are reported using ZCU102 as the FPGA platform. A conclusion readily apparent from Table IV is that fpgaHART is capable of delivering competitive performance on several 3D CNNs that have not been previously addressed and have a broad set of workloads and network parameters.

Figure 3 presents the current state of the Pareto front expressed in terms of accuracy over throughput (clips/s), where

the fpgaHART generated designs were derived targeting the VC709 FPGA platform. The results show that the fpgaHART models have pushed the Pareto front, delivering solutions with both high throughput and high accuracy, as shown in the graph.



Fig. 3: Pareto front on 3D CNNs: Clips/s over Accuracy. The fpgaHART results were taken using the VC709 FPGA platform, delivering solutions on the Pareto front.

Comparing the results on C3D (batch size 30 and targeting the ZCU102) to Nvidia RTX 3090, a server-grade GPU with 10496 CUDA cores and 1.7 GHz clock speed, the proposed architecture achieves a throughput of 4.42 clips/s compared to 281.87 clips/s that the GPU delivers. Yet, the proposed solution consumes only 26 W compared to the GPU's 298.6 W (excluding the CPU power consumption that a GPU system requires), offering 0.17 clips/s/watt compared to the GPU's 0.94 clips/s/watt.

#### VI. CONCLUSION

This paper proposes an automated toolflow for the deployment and mapping of 3D CNN models for HAR onto FPGA devices. The proposed method employs SDF theory to describe and map 3D CNNs to hardware architectures. We demonstrate that the tool supports a pool of 3D CNNs for HAR on a variety of FPGA devices, while exhibiting comparable throughput performance to hand-tuned techniques. Future work may involve expanding the design space with additional SDFG transformations and improving the tool to support and provide latency-driven optimisation-focused designs.

#### REFERENCES

- S. Ji, W. Xu, M. Yang, and K. Yu, "3D Convolutional neural networks for human action recognition," *IEEE Transactions on Pattern Analysis* and Machine Intelligence, vol. 35, no. 1, pp. 221–231, 2013.
- [2] H. Fan, X. Niu, Q. Liu, and W. Luk, "F-C3D: FPGA-based 3dimensional convolutional neural network," in 2017 27th International Conference on Field Programmable Logic and Applications, FPL 2017, 2017, pp. 2–5.
- [3] H. Fan, H. C. Ng, S. Liu, Z. Que, X. Niu, and W. Luk, "Reconfigurable acceleration of 3D-CNNs for human action recognition with block floating-point representation," in *Proceedings - 2018 International Conference on Field-Programmable Logic and Applications, FPL 2018*, no. Section III, 2018, pp. 287–294.
- [4] H. Fan, C. Luo, C. Zeng, M. Ferianc, Z. Que, S. Liu, X. Niu, and W. Luk, "F-E3D: FPGA-based acceleration of an efficient 3D convolutional neural network for human action recognition," in *Proceedings of the International Conference on Application-Specific Systems, Architectures* and Processors, vol. 2019-July, 2019, pp. 1–8.
- [5] H. Duan, Y. Zhao, Y. Xiong, W. Liu, and D. Lin, "Omni-sourced Webly-supervised Learning for Video Recognition," 2020. [Online]. Available: http://arxiv.org/abs/2003.13042
- [6] Z. Liu, P. Chow, J. Xu, J. Jiang, Y. Dou, and J. Zhou, "A uniform architecture design for accelerating 2d and 3d cnns on fpgas," *Electronics* (*Switzerland*), vol. 8, no. 1, 1 2019.
- [7] J. Shen, Y. Huang, Z. Wang, Y. Qiao, M. Wen, and C. Zhang, "Towards a uniform template-based architecture for accelerating 2d and 3D CNNs on FPGA," in *FPGA 2018 - Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, vol. 2018-Febru, 2018, pp. 97–106.
- [8] M. Sun, P. Zhao, M. Gungor, M. Pedram, M. Leeser, and X. Lin, "3D CNN acceleration on FPGA using hardware-aware pruning," in *Proceedings - Design Automation Conference*, vol. 2020-July. IEEE, 2020.
- [9] D. Tran, H. Wang, L. Torresani, J. Ray, Y. Lecun, and M. Paluri, "A Closer Look at Spatiotemporal Convolutions for Action Recognition," *Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition*, pp. 6450–6459, 2018.
- [10] P. Toupas, C.-S. Bouganis, and D. Tzovaras, "FMM-X3D: FPGA-based modeling and mapping of X3D for Human Action Recognition," in *IEEE 34th International Conference on Application-specific Systems, Architectures and Processors (ASAP)*, 5 2023. [Online]. Available: http://arxiv.org/abs/2305.18479
- [11] P. Toupas, A. Montgomerie-Corcoran, C.-S. Bouganis, and D. Tzovaras, "HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on FPGA Devices," in *Proceedings - 2023 International Symposium on Field-Programmable Custom Computing Machines*, *FCCM 2023*, 3. [Online]. Available: http://arxiv.org/abs/2303.17218
- [12] S. I. Venieris and C. S. Bouganis, "FpgaConvNet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs," *IEEE Transactions on Neural Networks and Learning Systems*, vol. 30, no. 2, pp. 326–342, 2019.
- [13] K. Soomro, A. R. Zamir, and M. Shah, "UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild," Tech. Rep., 12 2012. [Online]. Available: http://arxiv.org/abs/1212.0402