Three-dimensional deconvolution iswidelyusedin many computer vision applications. However,most previous workshaveonlyfocusedon accelerating 2D deconvolutional neural networks (DCNNs)on FPGAs, whilethe acceleration of3DDCNNshasnotbeen studied in depth asthe y have higher computational complexity and sparsity than 2DDCNNs. Inthis paper, wefocusonthe acceleration ofboth2D and 3D DCNNsonFPGAsby proposing efficientschemesfor mapping 2D and 3DDCNNsona uniform architecture. By implementing our designontheXilinxVC709 platform forfourreal-life2D and 3DDCNNs,wecanachieveupto3.0TOPSwithhigh hardware efficiency. Comparisons withCPU and GPU solutions demonstrate that wecanachievean improvement ofupto 63.3 x in throughput relative toaCPU solution and an improvement ofupto 8.3 x inen ergy efficiency compared toaGPUsolution.
I. INTRODUCTION
Recently, deconvolution has become widely usedinthe fieldsof computer vision, suchas semantic segmentation [1], generative models [2] , and high-resolution imaging [3] . Because 3D images existinmost medical datausedin clinical practice [4] , 3Dd econvolution haspro ven tobea better method than 2D deconvolution insome applications. Although the computational patterns of 2Dand3D deconvolutions areverysimilar,the computational complexity and memory requirements of 3D deconvolution are much higher thanin 2D deconvolution, making it challenging to design efficient accelerators forthem.In addition, deconvolution must insert 'zero' intothe input image before implementing convolution operations, leading tothespars ity of the input image aswell as the introduction ofinvalid operations( i.e., multiplications of zero) . According toourstudy,the sparsity of the input features of 3D deconvolution layersis higher thanthatof2D deconvolution layers.A s showninFig.1,the sparsity ofthe deconvolutional layersinan example of3D deconvolutional neural networks (DCNNs) (i.e., 3D-GAN [5] )is clearly higher thanfor2D DCNNs (i.e., DCGAN [2] ). Therefore, the sparsity contributes tothe processing engine (PE) workload imbalance [6] .
Many studies [7] [8] [9] have primarily focused on accelerating convolutional neural networks (CNNs) on Field-Programmable Gate Arrays (FPGAs), duetothe beneficial high performance andenergy efficiency of FPGAs. However,tothebest of our knowledge, not much attention hasbeengivento accelerate DCNNs, especially in3D deconvolution. Giventhe similarity inthe computational patterns of 2Dand3D deconvolutions, this work focuses on accelerating both of themon FPGA witha uniform architecture. The contributions of this work are summarized as follows:
1)We propose a uniform architecture for efficient implementation of 2Dand3D DCNNs onFPGA. 2)We propose a mapping scheme of 2Dand3D DCNNson the uniform architecture, whichcan efficiently improve the parallel computational ability and computational efficiency of the accelerator. 3)A s aca se study,we implement our design onanXilinx VC709 board forfour state-of-the-art2Dand3D DCNNs: DCGAN, GP-GAN [10] , V-NET [4] and3D-GAN. Experimental results showthatour implementation achieves an improvement of upto63.3 x and 291.4 x in throughput and energy efficiency relative to CPU,anda 8.3x energy efficiency gaino ver GPU.
II . RELATED WORK

Fewworkshave
focused on accelerating deconvolutions [6, 11, 12] . In [11 , 12] ,the researchers addres sed theaccelerations of the deconvolution in generative adversarial networks(GANs). Yazdanbakhsh etal.[11] introduced anew architecture to alleviate the sources of inefficiency associated withthe acceleration of GANs using conventional convolution accelerators by reorganizing the output computations. In [12] , an end-to-end solution was devised to generate an optimized synthesizable FPGA accelerator fromahigh -level GAN specification, alleviating the challenges of inefficiency and resources underutilization facedby conventional convolutional accelerators. Yanetal . [6] proposed anovel mapping method 
deconvolution issimilartothatof2D deconvolution.Theoriginalimageisfirst inserted with 'zero' betweentherowsandcolumnsofthe2Ddatatiles,whichis identicalto2Ddeconvolution.Inaddition,itisalso necessary toinsert 'zero' planes(i.e.,the MI plane)betweeneverytwo 2Dplanes(i.e.,theM2plane)anda K x K x K kernel then performs convolutions withthe inserted featuremapto generate an R' x 0' X Z' outputmap.
IV. THE PROPOSED ARCHITECTURE
A. Architecture Overview Fig. 2presentsanoverviewofour proposed architecture for accelerating 2Dand3D deconvolutions. overlapsprod uced by PEg,l rvPEg,l arethensen t tothei r FIFO -Hs,andt he overlapsprod ucedb y PE?,OrvPEg,l aresentto theirFIFO-Vs . Atstage2,activations 1(0,2,0,0) rv1 ( 
V. EXPERIMENTAL R ESULTS
Asacasest udy, weeval uate ourdesign usingfourrepresentative DCNNmodels: DCGAN,GP-GAN,3 D-GAN and V-Net. All thedeconvo lutionallayers ofthese lected DCNNs have uniform3 X 3and3 x 3 x 3filters .
Weq uantitatively compare ourFPGA implementation of2D and3DDCNNswithtwootherplatforms:
(1) a ten-core Intel E5C PU (2.8 GHz)and (2) 1 1 1 11 1 1 111 1 111 1 Inthispaper,we proposed a2 D and3 D deconvolution accelerator based onaun iform architecture onFPGA.We employed amappingschemeof2Dand3 D deconvolutions onthis architecture. Tothe bestofourknowledge,thisisthe firstworkto implement 2Dand3 DD CNNsonFPGA . By exploringthedata transference between adjacent PEs without invalid operations, ourdesignachievesan acceleration of 63.3 x compared withCPUimp lementation, andanenergy efficiency improvement of8.3 x compared withdesignsrun -ningonaGTX1080GPU . 
VI. CONCLUSION
