1,065 research outputs found
Accelerating Deterministic and Stochastic Binarized Neural Networks on FPGAs Using OpenCL
Recent technological advances have proliferated the available computing
power, memory, and speed of modern Central Processing Units (CPUs), Graphics
Processing Units (GPUs), and Field Programmable Gate Arrays (FPGAs).
Consequently, the performance and complexity of Artificial Neural Networks
(ANNs) is burgeoning. While GPU accelerated Deep Neural Networks (DNNs)
currently offer state-of-the-art performance, they consume large amounts of
power. Training such networks on CPUs is inefficient, as data throughput and
parallel computation is limited. FPGAs are considered a suitable candidate for
performance critical, low power systems, e.g. the Internet of Things (IOT) edge
devices. Using the Xilinx SDAccel or Intel FPGA SDK for OpenCL development
environment, networks described using the high-level OpenCL framework can be
accelerated on heterogeneous platforms. Moreover, the resource utilization and
power consumption of DNNs can be further enhanced by utilizing regularization
techniques that binarize network weights. In this paper, we introduce, to the
best of our knowledge, the first FPGA-accelerated stochastically binarized DNN
implementations, and compare them to implementations accelerated using both
GPUs and FPGAs. Our developed networks are trained and benchmarked using the
popular MNIST and CIFAR-10 datasets, and achieve near state-of-the-art
performance, while offering a >16-fold improvement in power consumption,
compared to conventional GPU-accelerated networks. Both our FPGA-accelerated
determinsitic and stochastic BNNs reduce inference times on MNIST and CIFAR-10
by >9.89x and >9.91x, respectively.Comment: 4 pages, 3 figures, 1 tabl
FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threaded C Software
A deep-learning inference accelerator is synthesized from a C-language
software program parallelized with Pthreads. The software implementation uses
the well-known producer/consumer model with parallel threads interconnected by
FIFO queues. The LegUp high-level synthesis (HLS) tool synthesizes threads into
parallel FPGA hardware, translating software parallelism into spatial
parallelism. A complete system is generated where convolution, pooling and
padding are realized in the synthesized accelerator, with remaining tasks
executing on an embedded ARM processor. The accelerator incorporates reduced
precision, and a novel approach for zero-weight-skipping in convolution. On a
mid-sized Intel Arria 10 SoC FPGA, peak performance on VGG-16 is 138 effective
GOPS
- …