ABSTRACT Deep neural networks (DNNs) have recently achieved remarkable performance in a myriad of applications, ranging from image recognition to language processing. Training such networks on graphics processing units (GPUs) currently offers unmatched levels of performance; however, GPUs are subject to large-power requirements. With recent advancements in high-level synthesis (HLS) techniques, new methods for accelerating deep networks using field programmable gate arrays (FPGAs) are emerging. FPGA-based DNNs present substantial advantages in energy efficiency over conventional CPU-and GPU-accelerated networks. Using the Intel FPGA software development kit (SDK) for OpenCL development environment, networks described using the high-level OpenCL framework can be accelerated targeting heterogeneous platforms including CPUs, GPUs, and FPGAs. These networks, if properly customized on GPUs and FPGAs, can be ideal candidates for learning and inference in resource-constrained portable devices such as robots and the Internet of Things (IoT) edge devices, where power is limited and performance is critical. Here, we introduce GPU-and FPGA-accelerated deterministically binarized DNNs, tailored toward weed species classification for robotic weed control. Our developed networks are trained and benchmarked using a publicly available weed species dataset, named DeepWeeds, which include close to 18 000 weed images. We demonstrate that our FPGA-accelerated binarized networks significantly outperform their GPUaccelerated counterparts, achieving a >7-fold decrease in power consumption, while performing inference on weed images 2.86 times faster compared to our best performing baseline full-precision GPU implementation. These significant benefits are gained whilst losing only 1.17% of validation accuracy. In this paper, this is a significant step toward enabling deep inference and learning on IoT edge devices, and smart portable machines such as agricultural robots, which is the target application.
I. INTRODUCTION
The promise of robotic weed control to provide a step change in the productivity of primary industry is widely coveted [1] , [2] . The rationale is clear -replace human involvement in this time-consuming and labor-intensive undertaking with more efficient autonomous machines. In addition, improving the efficacy of weed control would have enormous economic impact [3] . The majority of current works in this
The associate editor coordinating the review of this manuscript and approving it for publication was Xiaochun Cheng.
arena focus on the four core technologies of robotic weed control: detection, mapping, guidance, and control [4] . The robust and efficient detection of weed species remains a major obstacle to the widespread uptake of robotic weed control technologies [3] .
The use of DNNs specifically tasked for plant classification has demonstrated incredible performance in recent works [5] - [7] . Using the Intel FPGA SDK for OpenCL development environment, networks described using the highlevel OpenCL framework can be accelerated by targeting heterogeneous platforms with CPUs, GPUs, and FPGAs. A block diagram detailing a complete robotic weed management system, where our developed weed classification inference engine is used for low-power and real-time weed classification, that can trigger herbicide sprayers for the detected and classified weed species.
These developed networks are ideal candidates for edge computing applications, where low-power consumption and high performance are critical.
In this work, we investigate the acceleration of binarized DNNs on GPUs and FPGAs using the high-level OpenCL framework for weed species classification targeted toward robotic weed control, as depicted in Figure (1) . We demonstrate that our FPGA implementations, employing an Intel DE1-SoC FPGA development board, customized for edge processing, exhibit comparable performance to state-of-theart hardware implementations, employing an NVIDIA Titan V GPU and an AMD Ryzen 2700X @ 4.10 GHz Overclocked (OC) CPU, which are typically used for conventional desktop-based processing. Our specific contributions are five-fold:
• We implement and present the first FPGA-accelerated binarized DNN specifically tasked for weed species classification.
• Our complete FPGA-accelerated DNN runs on a standalone System On a Chip (SoC), requiring no host computer or extra device for partial computation.
• We investigate the effect of down-sampling input images on the DNN classification accuracy, and demonstrate that significantly reducing the image resolution has a marginal effect on accuracy.
• We thoroughly compare our efficient implementations on GPU and FPGA platforms and demonstrate that our new binarized FPGA-accelerated DNNs offer significantly reduced power usage while lowering per-image inference times compared to their conventional GPUaccelerated counterparts.
• We make our software code publicly available, to provide the community with the opportunity to replicate our experimental results and to adapt our utilized techniques for their various applications. The paper is structured as follows: Section II presents related works. Section III presents an overview of the algorithms and methods used in our designed networks. Section IV presents our new labeled version of the DeepWeeds dataset [8] , DeepWeedsX. Our image pre-processing techniques used are detailed in Section V. Section VI presents our developed software and hardware network architectures. Section VII reports the effect of image down-sampling on the network performance. Section VIII presents and discusses our software and hardware results, while Section IX provides further discussions on classification and real-time performance of the proposed design. The paper is concluded in Section X.
II. RELATED WORKS
A variety of techniques have been explored to automatically detect and classify target plant life. Means of detection can be categorised into one of three representations of the light spectrum: image-based [5] - [7] , [9] - [15] , spectrum-based [16] , [17] and spectral image-based [18] , [19] .
In [8] , we have achieved and demonstrated 95.7% weed classification accuracy on our large multiclass weed species image dataset, named DeepWeeds, using the ResNet-50 [20] Convolutional Neural Network (CNN) architecture on a single NVIDIA GTX 1080Ti GPU.
Performing real-time learning and inference, targeted for plant classification, on high-performance GPUs such as the NVIDIA GTX 1080Ti, is putative to consume large amounts of power, and hence, is ill-suited to deployment in portable resource-constrained smart devices and robotic systems, which are becoming commonplace. Consequently, devising methods and hardware synthesis techniques for reducing power consumption and for improving throughput of DNN hardware, becomes formidable.
While GPU-based implementations targeted towards image classification tasks are plentiful [21] - [26] , only one recent work [27] has detailed the implementation of custom hardware accelerators specifically tasked for agricultural purposes. Reference [27] demonstrates real-time performance by implementing an FPGA-based DNN on a Terasic DE1-SoC for plant detection in organic farming. The system developed in [27] can classify a target dataset with accuracy matching that of a state-of-the-art GPU while running at up to 42 frames per second with only 4 W of power consumption, which is 45 times lower than an NVIDIA GTX 1080Ti.
One solution proffered by developers such as NVIDIA, are embedded GPUs for mobile applications. The Jetson family of compute modules and the NVIDIA TensorRT library provide a power-efficient platform for accelerating deep learning architectures, some of which have already started being used in agriculture [8] , [28] - [30] . Although embedded GPUs offer substantial power improvements over conventional GPUs, they are still relatively power hungry when compared to FPGAs. Moreover, they are usually cost prohibitive and are not suitable for developing products that require mass production, such as inference engines on IoT edge devices [31] , [32] . Functionally verified HDL implementations developed for FPGA platforms can easily be synthesized to target Application Specific Integrated Circuits (ASICs), yielding considerable reductions in production costs. With recent advancements in HLS techniques, the development of FPGA-accelerated DNNs has been greatly expedited. FPGA-based DNNs present substantial advantages in energy efficiency over conventional GPU accelerated networks, while exhibiting marginal performance degradation [33] - [39] .
III. PRELIMINARIES
This section briefly reviews the algorithms and methods used in our developed networks for the DeepWeeds classification benchmark.
A. SOFTMAX REGRESSION & CROSS ENTROPY LOSS
The Softmax model is commonly used to apply logistic regression to multinomial problems. Softmax regression [40] determines a discrete probability distribution of each class, ρ i , for the output of the final layer in our developed deep networks using Equation (1) .
where, y i represents the predicted class. In addition,
where N is the number of classes to be distinguished. Softmax regression is commonly used in tandem with cross-entropy loss, presented in Equation (2), to enable the network to learn different classes during backward propagation where y i represents the class label.
B. BINARY WEIGHT REGULARIZATION
Since the solution space of DNNs is very broad, networks adopting gradient descent optimization algorithms, such as stochastic gradient descent with momentum, are susceptible to overfitting to training data. Overfitting significantly affects the generalization ability of the network and can lead to poor performance on test data. Regularization is any modification made to a learning algorithm to reduce its generalization error, but not its training error. Binary weight regularization, as proposed in [41] , constrains network weights to either +1 or −1 during forward and backward propagations. The binarization operation transforms the full-precision weights into binary values. Deterministic binarization is based on the sign function presented in Equation (3) .
where w b is the binarized weight and w is the real-valued fullprecision weight.
C. L2 AND BINARYNET REGULARIZATION LOSS TERMS
Regularization is usually added as a term to the learning loss function, to introduce another degree of control to the network parameter growth and help avoid the network overand under-fitting to the training data. Several different regularization techniques have been proposed in literature such as 1 and 2 regularization [42] , dropout [43] , and data augmentation [44] . 
Here, we utilize 2 regularization for implementations employing full resolution weights, and BinaryNetregularization for our implementations employing binarized weights. The 2-regularization term is defined in Equation (4).
Here, J (w, b) shows the overall loss function that includes a task-related loss term, L(w, b), summed with a regularization term, in which D determines the relative significance of the regularization, N denotes the total number of trainable network parameters, w represents a weight, and b is a bias. This regularization technique penalizes the large network weights by adding the sum of their square to the loss function, which helps prevent over-fitting. However, as Equation (4) is differentiable at w i = 0, 2-regularization has a non sparse solution, and does not perform feature selection. BinaryNet-regularization [45] can be defined as Equation (5),
where all the parameters are similar to Equation (4). Using BinaryNet-regularization, the weights are piloted to +1 or -1, rather than to 0 as in a 2-regularization, which is suitable for binary networks. implements Equation (3), and clip() clips values between −1 and +1.
IV. DEEPWEEDSX DATASET
The images used to construct the DeepWeedsX dataset have previously been made openly accessible [8] , however, they have not been labeled as test and training images. Instead, they have been used in a 5-fold cross validation configuration for training and validation in [8] . Here, we present a labeled variant of DeepWeeds, DeepWeedsX, with clearly defined training and test datasets. We use a splitting ratio of 6:1 (train: test), similar to the popular MNIST [46] , CIFAR-10 [47] , and CIFAR-100 [47] image classification datasets. The class distribution of the DeepWeedsX dataset is presented in Table (1) . A validation subset may be constructed for parameter optimization using a subset of the labeled training data. In order to facilitate future comparisons to this work, DeepWeedsX, including all its labels and images, is made publicly available. In addition, we have developed data-loaders for PyTorch and TensorFlow, which are intended to assist utilizations in new deep learning experiments 1 . It is worth noting that, one image was removed from the original DeepWeeds dataset [8] to ensure the training and test subsets have the same class distributions.
V. IMAGE PRE-PROCESSING
Images available from our DeepWeedsX dataset have a resolution of 256×256 pixels. In order to enhance the testing accuracy, several pre-processing steps can be undertaken before the presentation of the images to the network. All of our implementations adopt one of two image pre-processing techniques that are denoted using Image Pre-processing 
Techniques (IPT), and Further Image Pre-processing Techniques (FIPT).

A. IMAGE PRE-PROCESSING TECHNIQUES (IPT)
IPT down-samples input images from the native resolution of 256×256 to a resolution of 224×224, 64×64, or 32×32 pixels. No further image pre-processing techniques are performed.
B. FURTHER IMAGE PRE-PROCESSING TECHNIQUES (FIPT)
FIPT, similarly to [22] , randomly crops all images from a resolution of 256×256 to a resolution of 224×224 pixels. This is to ensure that the input images have the same size as the images in well-known datasets such as ImageNet [22] , which facilitates using networks developed for those datasets to be deployed for our DeepWeeds images. After all images are cropped, they are down-sampled to a resolution of 64×64, or 32×32 pixels. Their orientation is randomly rotated between -15 and +15 degrees. Their image brightness, contrast, and saturation are also randomly varied by 10%, and normalized between the values of 0 and 1. Finally, the color channels of each image are normalized using distinct mean and standard deviation values as seen in Table ( 2), which have demonstrated significant performance on the ImageNet dataset.
VI. NETWORK ARCHITECTURE
The complete structure of the implemented networks consists of two main components: the software and the hardware. The software architecture defines the targeted neural network structures, which are described in C++ and OpenCL kernels. The hardware architecture describes the integration of the hardware used to run the OpenCL kernels, and a host controller, which is the program executed on a host processor to launch OpenCL kernels and to manage available memory.
A. SOFTWARE ARCHITECTURE
Three popular deep network architectures, i.e. VGG-16 [21] , DenseNet-128-10 [48] , with 128 layers and a growth rate of 10, and Wide Residual Network (WRN)-28-10 [49] , with 28 layers and a growth rate of 10, were trained using full resolution and deterministically binarized weights for comparison. Although networks with tuned hyper-parameters would be expected to achieve higher validation accuracies, we present baseline implementations on our labeled dataset without any hyper parameter tuning. The mentioned networks are chosen as they demonstrate significant performance on the ImageNet dataset, which we believe to be indicative of high performance on the DeepWeedsX dataset. In favour of reproducible research, we have made the specific code level implementations of all these networks publicly available through a Github repository. 1 More details on our developed software network architectures are as follows. The output of each network's last layer is fed through a Softmax activation layer [40] , and each network's loss is minimized using cross-entropy. Stochastic gradient descent with momentum [50] is used to optimize network parameters. For all networks, the momentum variable m, is set to 0.8. In our implementations, BNNs employ a smaller initial learning rate, η[0] = 0.001, to avoid frequent sign changes, while for conventional networks using full resolution weights a larger initial learning rate of η[0] = 0.01 is used to speed-up learning convergence. In addition, the regularization coefficient, D = 5e −7 for our BinaryNetregularization terms and D = 1e −5 for our 2-regularization terms. For all BNNs, ReLU activation functions, used in conventional networks, are replaced with the hyperbolic tangent function, tanh.
Furthermore, in order to maximize the networks' performance, a decaying learning rate is used during training for all networks. This learning rate, η, is decayed by a factor of ten when learning falls stagnant, i.e. does not increase for a period of 10 epochs. Finally, the weights are randomly uniformly distributed using the He initialization technique presented in [51] to accelerate learning convergence for full resolution weights used for the parameter update stage.
B. HARDWARE ARCHITECTURE
After functionally verifying our implementations using the PyTorch [52] ML library with a state-of-the-art Titan V GPU and an AMD Ryzen 2700X @ 4.10 GHz Overclocked (OC) CPU, we developed hardware architectures consisting of C++ host controllers and multiple OpenCL kernels, which were accelerated using FPGAs and GPUs. For x86-based systems, OpenCL accelerated kernels using FPGAs typically reside on an FPGA development board, which is connected to a separate independent host system through the PCIe express interface [35] . For ARM-based systems, the FPGA is typically connected to a Hard Processor System (HPS) on a SoC through specialized bridges -as in the case of the Intel DE1-SoC development board used herein. This allowed our proposed FPGA-accelerated networks to run completely independently using the SoC, without using a separate device for computation.
Our OpenCL implementations originate from the publicly available DeepCL OpenCL ML library. 2 All convolutional, inner-product, activation, pooling, regularization, and batchnormalization operations are described using single workitem kernels with an NDRange size of (1, 1, 1). Multi-mode 3D NDRange kernels are used to fetch and store data to and from the global memory for all computation pipelines, similarly to [35] . Consequently, our implementations operate with minimal controller computation. We make these efforts toward an eventual implementation that avoids controller computation completely, with motivations to make our future designs applicable to both GPUs and non-SoC FPGAs, avoiding the power overhead of the host controller, that is required for the OpenCL programming model. In the following subsections, we detail the different synthesis constructs, such as the loop unroll factor (#pragma unroll for GPU), and Single-Instruction-Multiple-Data (SIMD) vectorization factor, used for our developed single work-item kernels targeted for acceleration on heterogeneous platforms.
1) CONVOLUTIONAL KERNEL
Convolutional operations were implemented by mapping 3-D convolutions as matrix multiplication operations, by flattening and rearranging the input features, similarly to [53] . Each work-item performs either fused multiply and accumulate (MAC) operations, or accumulate operations, on the local memory data, depending on whether weights are quantized to binary states or not. This process is accomplished using the loop unrolling technique, and is repeated by sliding the convolutional window to get the corresponding elements in the product matrix. We further detail the XNOR kernels, described using Register Transfer Level (RTL) modules, utilized in our FPGA BNN implementations for convolutional and FC kernels in Section VI-C.
2) INNER-PRODUCT KERNEL
Inner-product operations were implemented for Fully Connected (FC) layers using single work-item kernels, where, similarly to our convolutional kernel, fused MAC, or accumulate operations are performed, depending on whether weights are quantized to binary states or not.
3) ACTIVATION KERNEL
All activation functions were computed at the output of convolution and inner product implementations using single work-item kernels.
4) POOLING, REGULARIZATION, & BATCH-NORMALIZATION KERNELS
Pooling, regularization, optimization, and batch normalization operations were implemented using single work-item kernels, where acceleration is achieved by unrolling the loop to generate parallel outputs in a single cycle.
C. PLATFORM SPECIFIC COMPILATION & PERFORMANCE ENHANCING TECHNIQUES
In this section we detail the platform specific compilation details, libraries, and resources required for compilation and deployment of our developed host controller and OpenCL kernels, for FPGA and GPU platforms.
Both our GPU and FPGA implementations of BNNs are accelerated using XNOR kernels, which enables Single Instruction Multiple Data (SIMD) within a register (SWAR) [41] . Here, full precision 32-bit floating point weights are concatenated into groups of 32 binary variables into 32-bit registers, resulting in a 32-time speed-up on bitwise operations. 
1) FPGA
To compile OpenCL kernels for FPGA, the Intel FPGA SDK for OpenCL Offline Compiler (IOC) was used, as part of the Intel FPGA SDK for OpenCL and Quartus Prime Design Suite 18.1. IOC fully supports version 1.0 of the OpenCL specification, and has some preliminary support for newer features from version 2.0. Figure ( 2) presents the compilation flow for the IOC. Here, inputs are a set of OpenCL kernels, and the output is a singular .aocx image file containing the necessary information to program the FPGA at runtime containing the FPGA image. The host application loads data that is used to create program objects, to program the target FPGA, as required for all kernel launch operations.
SWAR was also implemented using a digital logic approach, similarly to [54] , to accelerate our FPGA BNN implementations. Figure (3) illustrates SIMD using 32-bit registers. Using RTL modules, the given size of each register to store binary weights is arbitrary and not constrained by the computer architecture. These instructions, therefore, take only four clock cycles on FPGAs.
We integrate our developed XNOR RTL module into the Intel FPGA SDK for OpenCL Pipeline using the IOC. Our RTL module has a balanced latency, where its threads match the number of pipeline stages in our design. This allows the threads of the RTL module to execute without stalling the SDK's pipeline and bottlenecking operational performance. 
2) GPU
In addition to accelerating the targeted DeepWeedsX recognition networks on the Intel DE1-SoC FPGA development board, the networks were also accelerated on a state-of-theart Titan V GPU to execute OpenCL kernels and an AMD Ryzen 2700X @ 4.10 GHz Overclocked (OC) CPU to drive the host controller. We use version 419.35 of the Titan V GPU driver to launch compute kernels. Using SWAR OpenCL kernels, on GPUs, it is possible to evaluate 32 network connections with only 3 instructions (accumulation, popcount, and xnor), as described in Equation (6) .
Here, l denotes the layer number, a l is the resulting weighted sum, and a 32b l−1 , w 32b l are the concatenated inputs and weights. These instructions take 6 overall clock cycles, including 1 for accumulation, 4 for popcount, and 1 for xnor, on the NVIDIA Titan V GPU used [55] .
While NVIDIA's CUDA compiler is, presently, much more efficient and mature than their OpenCL compiler, to advocate fair comparison, we use OpenCL implementations across FPGA and GPU platforms. We note that enhanced timing performance is expected on NVIDIA GPUs using CUDA alongside with NVIDIA's Deep Neural Network library (cuDNN).
VII. INVESTIGATING THE EFFECT OF IMAGE DOWN-SAMPLING AND PREPROCESSING ON PERFORMANCE
Before gathering implementation results, VGG-16 [21] , DenseNet-128-32 [48] , and WRN-28-10 [49] were trained using full resolution weights with a batch size, = 32, for input images down-sampled to three different sizes. We restrict the batch size to 32 and below due to the harsh real-time constraints presented by our specific edge computing use case for robotic weed control, which is discussed further in Section IX-B. These sizes include dimensions of 32×32, 64×64, and the original size of 224×224. This was performed to investigate the down-sampling effect on the degradation of our chosen deep networks validation accuracy. Figures (4) , (5), and (6) demonstrate the validation accuracy of all networks over 200 training epochs. While other works report results over a 500 epoch training routine, we observed that learning on the training set was saturated at around 150 epochs and therefore report results over 200 epochs. The maximum validation accuracy achieved for each implementation is presented in Table ( 3). It is worth noting that, for the networks shown in Table ( 3), no hyper-parameter optimization is performed. In addition, for the FIPT cases, we replicated our image pre-processing steps originally proposed in our previous work [8] , which demonstrated significant performance for input images with a resolution of 224×224. It is expected that, hyper-parameter optimization, different types or amounts of image preprocessing techniques, or a combination of them could lead to validation accuracy improvement.
From Figures (4) , (5), and (6) it can be observed that down-sampling input images to (3, 32, 32) leads to performance degradation, compared to using images with (3, 64, 64) or those with a native resolution of (3, 224, 224) . In addition, the figures show that for low-resolution images, i.e. (3, 32, 32) and (3, 64, 64), performing FIPT leads to lowering accuracy. FIPT is useful in the case of the native resolution images and leads to the highest accuracy achievable.
Interestingly, it is possible to achieve >85% validation accuracy using down-sampled input images with IPT at (3, 32, 32) , which can significantly improve processing speed and reduce the total required memory utilization during inference, compared to the use of higher resolution images. Therefore, we down-sample all images to (3, 32, 32) to obtain all our implementation results reported in the following sections. This is expected to drastically reduce the training time required, as well as the resource utilization for all networks, which have previously been demonstrated to be governed by the total number of trainable network parameters, input image size, and network configuration [56] . Table ( 4) presents the total trainable network parameters for each network architecture used to obtain our implementation results. For VGG-16 and DenseNet-128-32 architectures, additional parameters are required for larger resolution input images to implement additional max pooling and fully connected layers. However, the WRN-28-10 network parameter size shows no dependency to the input image resolution, which could be attributed to its wide (not deep) structure.
VIII. IMPLEMENTATION RESULTS
Initially, conventional networks trained using full resolution weights were all implemented on the GPU platform (see Table 3 ). This was to have as a baseline for validation accuracy comparison to both previous works and our binarized implementations. Direct comparison to the relevant previous work [8] is not possible using the given labeled dataset (DeepWeedsX), because, in [8] five-fold cross-validation is used to report accuracy, while here we obtained accuracy using a train-test dataset split. Nonetheless, the best validation accuracy achieved here (95.85%) using WRN-28-10 with FIPT, is marginally better than the best accuracy previously achieved using ResNet-50 with FIPT (95.7%) as reported in [8] .
In order to validate and investigate the performance of the proposed FPGA-and GPU-accelerated BNN architectures on the DeepWeedsX dataset, the validation error rate, power consumption, and Inference Time Per Image (ITPI) were analyzed. On both FPGA and GPU platforms, all aforementioned performance metrics were determined during inference, i.e. after importing trained weights, for different batch sizes ∈ [4, 8, 16, 32] to investigate the effect of the batch size on inference performance. As discussed earlier, we restrict the batch size to between 4 and 32 due to the harsh real-time constraints presented by our specific edge computing use case for robotic weed control (see Section IX-B). For FPGA implementations we also report the resource utilization, which is an important parameter in identifying hardware cost.
The total kernel power usages were determined using the Post Place & Route Estimator for FPGA post-synthesis, and NVIDIA-SMI for GPU. To ensure all reported power usage readings are accurate for our GPU implementations, we artificially elongate kernel execution times when measuring GPU kernel power utilization.
It is worth noting that, there is no need to consider the time required for image pre-processing because for all (3, 32, 32) cases, networks employing IPT outperform their FIPT counterparts (see Table 3 ). It is expected that during real-time inference, images are fed directly to the inference engine after undergoing pipelined real-time image resizing similar to [57] , hence, requiring no down-sampling (IPT). There is also no need for FIPT, because FIPT is only useful for (3, 224, 224). We used images of size (3, 32, 32) .
A. BASELINE IMPLEMENTATIONS
Baseline implementations of VGG-16 [21] , DenseNet-128-32 [48] , and WRN-28-10 [49] , were trained using full resolution weights with = 32, on GPU for images of size (3, 32, 32) . The collected performance metrics are reported in Table 5 . It can be observed that DenseNet L=128, K=32 achieves the highest validation accuracy of 90.08%. It requires 0.64 ms less inference time per image, compared to its VGG-16 variant, while consuming the same amount of power and yielding an increase 0f 3.36% in validation accuracy. Compared to WRN-28-10 network, the DenseNet L=128, K=32 achieves 1.2% improvement in accuracy, while being 1.86 ms slower in image inference, but consuming 3.67 W less power.
Furthermore, in order to determine class specific performance, a confusion matrix for the top performing baseline implementation, DenseNet L=128, K=32, is presented in Figure (8) . It can be observed that the individual class accuracy is weakly correlated to the class distribution in the DeepWeeds dataset. Further species-specific performance discussion is presented in Section IX-A.
B. BINARY IMPLEMENTATIONS
We implement and train our new binary networks of the selected deep architecture across GPU and FPGA platforms. The validation accuracy for each epoch during training for each implementation is presented in Figure (7) . As all networks are trained using the same NVIDIA Titan V GPU, only one plot for each network is presented. Consequently, the validation accuracies and corresponding confusion matrices are identical across platforms.
From Figure (7) , it can be observed that, similar to our baseline implementations adopting full resolution weights, binary DenseNet L=128, K=32 achieves the highest validation accuracy, with a degradation of only 1.17% compared to its full-resolution counterpart.
In addition, Table (6) reports the total kernel power usages and inference times for both GPU and FPGA implementations for various batch sizes of ∈ [4, 8, 16, 32] . Table ( 6) demonstrates that as the batch size, , is increased, the inference time per image is notably decreased. We believe this is a a direct result of increased parallelism.
For all networks, the FPGA implementations require less time to perform inference despite operating at a much lower operational frequency. Our DenseNet L=128, K=32 implementation on FPGA reduces the inference time per image compared to its GPU counterpart by 0.89 ms for = 32 while achieving the same validation accuracy, and by 2.86 ms compared to its baseline implementation time reported in Table (5) . These results and their implications are discussed in Section IX-B.
Furthermore, in order to determine class specific performance of binarized networks on GPU and FPGA, a confusion matrix for the top performing implementation, DenseNet L=128, K=32 is presented in Figure (9) . Interestingly, the class specific performance for binarized DenseNet varies significantly across various weed species, while degrading only 1.17% overall. Section IX-A provides further discussion on species-specific performance.
In addition, to investigate the hardware complexity of the implemented binarized networks on FPGA and compare it to the achieved inference and power consumption figures, the device utilization of all FPGA accelerated networks were measured and presented in Table (7) .
From Table (7) it can be observed that the device utilization for each network architecture is weakly correlated with the batch size during inference. For all our implementations, as is increased, the Flip Flops, ALMs, DSPs required for synthesis are increased and the maximum synthesizable frequency is decreased. Our best performing implementation, DenseNet L=128, with = 32, requires 46.58% more Flip Flops, 57.7% more ALMs, and 34.08% more DSPs than its = 4 counterpart. Device utilization (%) comparison for the implemented FPGA-accelerated BNNs adopting IPT. All reported hardware utilization numbers are expressed as percentages of the total available resources on the FPGA. 1 The maximum frequency for each implementation was extracted from acl_quartus_report.txt report, generated by the Quartus Design Studio.
IX. FURTHER DISCUSSION
Here we provide further discussion of our implementation results and how they benefit the application of robotic weed control.
A. CLASSIFICATION PERFORMANCE
Tables (5) and (6) show that our baseline and binary implementations of the DenseNet architecture consistently outperform VGG-16 and WRN-28-10. With images drastically down-sampled to (3, 32, 32) , our full precision baseline DenseNet-128-32 implementation offers a validation accuracy of 90.1% on the DeepWeedsX dataset. This compares well with the ResNet-50 architecture used in [8] to achieve 95.7% performance on much larger images, 224 × 224 pixels in size.
However, validation accuracy is an unreliable sole metric due to the imbalanced nature of the DeepWeeds and DeepWeedsX datasets. In the application of robotic weed control, the goal is to maximize coverage of weed targets at the cost of collateral off-target damage. As such, it is vitally important to consider the performance on each individual species. The confusion matrix of the baseline implementation of DenseNet, presented in Figure ( 8) , is analogous to the classification performance in [8] . The species with the highest recall accuracy, ranging from 87-95%, include: Lantana, Parkinsonia, Parthenium, Prickly acacia, Rubber vine, Siam weed and negative plant life. The species presenting the most difficult challenge for the network, ranging from 80-82% recall accuracy, are Chinee apple and Snake weed. With the model confusing 9% of Chinee apple as Snake weed, and 5% vice versa, we reason their respective poor performance is due to the inherent similarity in the two plants image features. The similarity between these two classes is made evident in Figure (10) , which presents an example image of each class.
Fortunately, confusing one weed target for another is inconsequential in the application of robotic weed control. However, missed targets (i.e. false negatives) and off-target damage (i.e. false positives) are of great consequence so we must examine the performance on the negative class. The baseline DenseNet architecture confuses between 3-11% of each species with the negative class. This constitutes a significant number of missed targets. Also, 5% of the negative class is falsely classified as positive species. The existence of false positives is assumed to be the result of the massive variation of plant life in the negative class. Table ( 6) reveals that the best performing binarized implementation is again the DenseNet-128-32 architecture, which offers 88.9% average classification accuracy. Interestingly, the inter-species performance is vastly different to the baseline implementation, as shown by the confusion matrix in Figure (9) . Our binarized implementation of the DenseNet architecture appears to have generalized the classification performance of the network across species. Confusions between species are more scattered and are no longer owing to visibly discernible characteristics. The species offering the best performance, ranging from 83-88%, include: Lantana, Prickly acacia, Siam weed and Snake weed. While the species presenting the most difficult challenge, ranging from 74-81% are: Chinee apple, Parkinsonia, Parthenium and Rubber vine. We believe that the extreme quantization of weights to binary states prevented less-distinctive features to be extracted, causing a degradation in validation accuracy for weed species with a large number of features. Species with distinctive features, or lack thereof, demonstrated similar accuracies to our baseline implementations. For example, in both our best performing implementations the negative class is strongly classified at 95% with 5% false positives.
Tables (5) and (6) also show that the performance of all binarized implementations are slightly worse than their full precision baseline counterparts. This result confirms what is seen in literature. However, the real-time performance of these binarized networks are far greater than their full precision counterparts and present a fruitful tradeoff for the application of robotic weed control, as discussed below. 
B. REAL TIME PERFORMANCE
Tables (5) and (6) show that every novel binarized implementation presented here significantly outperforms its GPU baseline implementation in terms of inference time and required power, with only a slight degradation in validation accuracy. The WRN architecture offers the fastest inference engine with a 1.018 ms inference time when implemented on an FPGA. While DenseNet-128-32, our most accurate architecture, also offers a significant increase in speed, with an average inference time of 1.539 ms per image.
In addition to the binarization of full precision architecture and acceleration using FPGA hardware, two further methods of pushing the real-time performance were investigated: presenting a smaller image size to the network, and reducing the amount of pre-processing or image augmentations performed. Table ( 3) reveals that down-sampling the image size from (3, 224, 224) to (3, 32, 32) only slightly decreases the average validation accuracy from 94.24% to 90.1% for DenseNet, 93.04% to 86.7% for VGG-16 and 95.85% to 88.9% for WRN. The significantly smaller image size of (3, 32, 32) is a major reason these networks perform inference faster, compared to the (3, 224, 224) shaped architectures in [8] . Table ( 3) also shows that applying further image preprocessing techniques offers no advantage in classification performance for the utilized down-sampled images. In fact, we observed that validation accuracy is degraded by an average of over 4% for the binary implementations when images are down-sampled below (3, 224, 224) when further image augmentations are applied. This suggests that our FIPT are ill-suited to low resolution images and can only be beneficial to large-resolution images, which we do not use here. However, as discussed in Section VII, these results are expected to be significantly improved if further image preprocessing technique investigation, and/or hyper-parameter optimization is performed, for each network, and each input image size.
Let us consider the use case of the prototype agricultural robotic spot-sprayer, AutoWeed, first introduced in [8] . Its optical system comprises four FLIR Blackfly 23S6C highresolution cameras, each providing a 450 x 280 mm field of view with a maximum data rate of 41 fps. This allows a threshold of at most 100 ms total processing time per image per camera for the selective spot sprayer to operate at the target speed of 10 km/hr. This imposes a required frame rate of at least 10 fps per camera to achieve target real-time performance. With simultaneous image acquisition from four cameras, batch sizes of between 4 and 32 were considered for our above implementations. The lower limit of 4, considers processing one frame at a time from each camera. While the upper limit of 32, considers waiting for the acquisition of up to 8 frames from each camera before processing them in a batch for the apparent per image inference speed increase.
With an average inference time of 1.539 ms implemented on an FPGA, the 450 × 280 mm field of view of our specific edge device can be processed fast enough to achieve a frame rate of over 600 fps. Far exceeding the real time requirement for one camera at 10 km/hr. This efficiency would also allow the robot to operate at a much higher vehicle speed to yield more efficient performance in the agricultural domain.
Compared to existing full precision architectures and their power-hungry GPU implementation [8] , the low-power and high-speed inference engines presented here offer an attractive tradeoff with slightly worse classification performance for greatly increased speed and power efficiency. This tradeoff will allow researchers and developers to solve the speed and power inefficiencies in applications of precision agriculture, like robotic weed control, with software instead of hardware. The proposed binarized solutions can also be generalized for improving the efficiency of edge computing in general, where slight amounts of accuracy can be traded off for great amounts of speed and power improvement.
X. CONCLUSION
In this paper, we are the first to investigate the acceleration of binarized DNNs on GPUs and FPGAs using the high-level OpenCL framework for weed species classification targeted toward edge computing applications and robotic weed control. We investigated the performance degradation exhibited when down-sampling input images, and demonstrated that significantly reducing the image resolution has a marginal effect on validation accuracy. After thoroughly comparing efficient implementations on GPU and FPGA platforms, we were able to achieve a >7-fold decrease in power consumption, while performing inference on weed images 2.86 times faster while degrading validation accuracy by only 1.17% on our newly labeled and publicly available dataset. Finally, we provided further discussion pertaining to species-specific classification performance and real time performance implications for robotic weed control. The implemented networks demonstrated here represent ideal candidates for future implementations in edge computing devices and precision agricultural robots.
