Abstract-Recent advances in photonics and imaging technology allow the development of cutting-edge, lightweight hyperspectral sensors, both push-broom/line-scanning and snapshot/frame. At the same time, emerging applications in robotics, food inspection, medicine and earth observation are posing critical challenges on real-time processing and computational efficiency, both in terms of accuracy and power consumption. In this direction, in the current paper, we accelerate hyperspectral processing kernels by utilizing FPGAs, i.e., Zynq-7000 SoC, to perform similaritybased matching of spectral signatures. We propose a custom HW architecture based on multi-level parallelization, modularity, and parametric VHDL coding, which allows for in-depth design space exploration and trade-off analysis. Depending on configuration, our implementation processes 22−107 Megapixels per second providing an acceleration of 40−355x vs Intel-i3 CPU and 360−10 4 x vs the embedded ARM Cortex A9, whereas the overall detection quality ranges from 56% to 97% when evaluated with multiple objects and images of 285 spectral channels.
I. INTRODUCTION
Hyperspectral imaging has gained ground in applications ranging from satellite or airborne remote sensing, to industrial quality control, quality assessment and food inspection. The key advantage of this technology lies in the rich information content of the acquired images, which improves the ability to classify objects in a scene based on their spectral properties. The reflectance, transmittance, and emittance properties of materials is measured and observed using spectral response curves, which depict proportions of reflected, transmitted, or emitted electromagnetic radiation as a function of wavelength. The unique characteristics of these curves are used as signatures/fingerprints to identify features remotely.
Currently, cutting-edge hyperspectral imaging sensors onboard satellite, aerial, UAV, or terrestrial platforms, are generating nearly continual streams of high-dimensional and highresolution data. They enable the acquisition of hundreds of images in contiguous narrow spectral bands (bandwidth <30nm), typically in the visible (VIS), near-infrared (NIR), shortwave (SWIR), mid (MID) and longwave (LWIR) infrared. The number of recorded channels per pixel varies from 3 (as in conventional RGB images) to few hundreds (e.g., 285 in APEX [1] , or more). In order to exploit efficiently these huge datasets and address real-time requirements, advanced onboard processing algorithms and platforms are a prerequisite.
To this end, hardware accelerators like Field Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) are the main tools for significantly increasing the computational performance [2] . However, when high-speed applications are considered onboard satellite (e.g., micro/nano constellations), lightweight drones, UAVs and terrestrial robotic platforms where the power consumption and the weight/size of the processing system impose critical constrains, FPGAs are employed due to their comparative advantages [2] , [3] .
FPGAs prevail in embedded applications due to their relatively low power dissipation and increased performance per Watt. Their advantages are due to their internal structure, i.e., to a huge network of small processing/storage elements (such as LUTs and DFFs, considerably smaller than GPU cores and VLIW units), which can be configured/connected together arbitrarily to create bigger components to execute complex algorithms. This fine-grain structure allows for very deep pipelining and paralellization at multiple levels. In addition, the FPGA's I/O flexibility allows for custom, highrate, communication with sensors recording big amounts of data. Therefore, the FPGA is ideal for examining on-the-fly numerous hyperspectral pixels, against a plethora of spectral signatures, to search for multiple objects/materials in parallel. Furthermore, today's System-on-Chip devices, like Zynq [4] , facilitate the HW/SW co-design of more sophisticated solutions via tightly coupling an embedded processor (PS) with conventional FPGA fabric (PL), which allows us to run lightweight SW code next to custom HW accelerators.
In the current paper, we exploit the inherent parallelization of hyperspectral pixel matching and Zynq FPGAs to accelerate straightforward matching kernels by 40−355x vs Intel-i3 CPU and up to 10 4 x vs the embedded ARM Cortex A9. Such immense speedup factors can prove of utmost importance in applications relying on fast decision making, e.g., in automated food inspection, simultaneous localization and mapping, or obstacle avoidance onboard UAVs; in such cases, the hyperspectral sensors record millions of pixels and feed algorithms striving to detect distinct materials at very high speeds. To achieve such high throughput, in the area of 10−100's Megapixels per second, we propose a parametric HW architecture with parallelism at multiple levels: pixel, signature, channel, and deep pipelining. The architecture inputs continuously new pixels and compares them to several stored spectral signatures, e.g., 16 to 128. Our modular design supports the use of various matching metrics, e.g., L1, L2, or χ 2 , while our parametric VHDL allows for customization and design space exploration to benefit from the underlying accuracy-cost-speed tradeoffs.
II. RELATED WORK
Due to its non-destructive nature, hyperspectral imaging has proved very useful in the food industry to facilitate effective inspection of quality and safety [5] ; it improved the detection accuracy for a wide range of food products, e.g., 89%-100% for meat. In such systems, advancements in image processing facilitate classification at more than 100fps for a 256-channel camera [6] . In medical applications, hyperspectral imaging has significantly contributed during the last decade on disease diagnosis and surgery guidance, mainly in the ultraviolet (UV), visible (VIS), and near-infrared (NIR) regions [7] . Overall, various processing steps are employed, including data compression, radiometric and geometric calibration, denoising, pansharpening, etc. [3] [2], [5] , or even reduction of the data set size or unmixing (because typical reflectance datasets/hypercubes contain millions of similar spectra).
FPGA acceleration proves to be a key solution in the literature towards increasing the capabilities of an image processing system. The resulting speedup factors are impressive and essential to the application, i.e., 1 order of magnitude faster execution when compared to high-end Desktop CPUs and 2 -or more-orders when compared to embedded CPUs, whereas the power dissipation is 1 order of magnitude lower than desktop CPUs and GPUs. In particular, the authors in [8] propose a CPU-FPGA implementation for High-Performance Face Detection with conventional images; they utilize an Altera Stratix-V A7 FPGA (∼55% Logic Elements and ∼50% M20K RAM) to achieve 30x speedup compared to an Intel Core i5-4590 CPU. In [9] , they develop a pipelined architecture for traffic sign classification; it processes 241.7 Megapixel/sec (3-channel pixels, RGB) at 241.7 MHz by utilizing 8.2K LUTs and 11.8K DFFs on a Xilinx Zynq ZC706, whereas it achieves 106x speedup vs SW execution on Intel Core i5 CPU.
In image processing applications involving more channels, i.e., multi-or hyper-spectral, the FPGAs are also gaining ground. The authors in [10] have developed a system for automatically detecting targets in remotely sensed data; a parallel implementation on a Virtex-7 FPGA utilizes 133K LUTS (31%) and 2874 DSPs (80%) to detect 19-30 target pixels with 224 channels, for two different images, and achieve a speedup of up to 4.6x compared even to a high-perfromance many-core CPU like the 14-core Intel Xeon E5-2695. For 55-channel pixels and straightforward matching techniques, like crosscorrelation, the authors in [11] utilize a Virtex5 to achieve 145.8 Mpixel/sec per signature and ∼600x speedup against MATLAB on an Intel Pentium Dual Core CPU. In [12] , the detection system achieves 0.2 Mpix/sec on XC7K325T FPGA and 63x speedup vs MATLAB on Intel Core2 Quad CPU.
III. PROPOSED ARCHITECTURE
The first step towards developing an efficient HW/SW architecture is to identify the most computationally intensive part of the hyperspectral algorithm and accelerate it on HW. In the current paper, we assume a generic algorithmic approach, which bases on a highly repetitive pixel-by-pixel matching of the input image to a set of a-priori known signatures. Most often, such a matching kernel is 1-3 orders of magnitude more intensive than certain functions of the algorithm like decubing, denoising, simplification, or others, which build on top of the matches and operate on a significantly smaller set of data. Hence, we propose a HW architecture to accelerate a generic class of techniques for matching. The class is characterized by a straightforward approach to calculate some form of distance/similarity between pixels according to their channelby-channel difference/correlation i.e.,L1, L2, χ 2 , normalizedcross-correlation, spectral angle, etc. As representative examples, to account for various complexities, we focus on the three metrics of Eq. 1 (where x denotes a pixel, y is a signature, both of C total channels): simple
A. HW Optimization and Multi-level parallelization
The component developed in HW inputs pixels successively (e.g., C channels per clock cycle) and outputs an ID signifying the best matched signature for each pixel. The best match is detected internally when one of the functions of Eq. 1 is minimized while comparing to a set of signatures stored in ROM. To exploit the highly parallel structure of FPGAs, we develop a HW architecture parallelized at 3 distinct levels: 1) pixel-parallelism, i.e., process multiple pixels x, concurrently, by instantiating multiple identical engines (1 per pixel).
2) signature-parallelism, i.e., compare one pixel x to multiple signatures y, concurrently, instead of searching sequentially.
3) channel-parallelism, i.e., calculate multiple partial sums of Eq. 1 concurrently, by parallelizing in multiple x i , y i pairs.
In a bottom-up approach, in level 3 we focus on Eq. 1 to design a fast "metric calculator", whereas level 2 builds on top of 3 to replicate our "metric calculator" in multiple, coordinated, instances (Fig. 1) . Placed at the core of the architecture, the "metric calculator" is designed as a finegrained pipeline, which can be replaced in a modular fashion by any function of Eq. 1. The parallelism of the calculator is achieved by instantiating multiple, P C , individual x i , y i pair processing units (each of pipeline depth 5 to 28 depending on the utilized adder/multiplier/divider chain), which feed an adder tree of depth log P C feeding the output accumulator.
With signature parallelism, Fig. 1 depicts P S identical "metric calculators" connected to P S independent ROMs storing disjoint subsets of our S signatures. Each pixel entering the engine is broadcast to all P S calculators, each of which outputs a new metric value every C/P C cycles (one for each signature examined from its ROM). In a synchronized fashion (via simple counter-based control logic), the results enter a "decision tree" to select the best match after C×S/(P C ×P S ) cycles. As shown in Fig. 2 for P S = 8, the new metrics enter a Fig. 1 . Signature-level parallelism in a Pixel-Signature matching engine tree of comparators augmented with a tree of control registers to keep track of the winning candidate. More specifically, at each level of the tree, the left/right winner of each comparison is marked by 0/1 and forwarded to a register at the next node. After the final node, we compare the current winner of 8 to the best candidate so far. In case of a new best, we concatenate the value of a running counter (practically, the current ROM address) to the 3 spatial marks (practically, the ROM module of the current winner) to form a final pointer to the best match. This time-space multiplexed indexing adapts to C, S, P C , P S and provides an ID, which is consistent with our signature list.
On top of the aforementioned architecture, we apply optimizations at implementation-level (both area-and time-wise):
• to increase frequency, we custom place registers along all paths and balance our deep pipelines (e.g., assuming 20-bit numbers, we use 4 stages for adders/comparators, 5 for multipliers, 22 for dividers, put DFF between all tree levels).
• careful programming of the control priorities of registers (reset, set, enable) to minimize the logic levels (LUT chains), as well as remove redundant controls (reset) from replicated processing units to decrease resources (up to 50%, combined).
• manual tuning of synthesis/implementation directives (Xilinx ISE/Vivado, e.g., register duplication to handle large fan-out).
B. Parameterization/Exploration and System Integration
The aforementioned architecture was developed with parametric VHDL to facilitate design space exploration with respect to: parallelism (pixels, P C , P S ), employed metric, C, S, and datapath bits (internal accuracy). Such parameterization provides a multitude of configurations allowing the designer to adapt to the application needs and/or utilize all available resources (see Section IV). The implemented kernel can be integrated in a HW/SW system by employing the AXI4 channels of Zynq to communicate with the CPU running the SW parts of the algorithm. The maximum PS-PL bandwidth over AXI4 on Zynq7000 is 12 Gbps [4] . Notice that, to avoid under-utilizing the HW resources, we should limit our HW parallelism to achieve certain throughput to match this communication bottleneck: e.g., for 10-bit 64-channel pixels, the actual throughput cannot exceed 18.7 Megapixel/sec.
IV. RESULTS AND TRADE-OFF ANALYSIS
We evaluate the proposed HW architecture and perform a design space exploration considering cost, acceleration, and matching accuracy. For benchmarking, we use the dataset of APEX (Airborne Prism Experiment, developed for ESA) [1] , which was acquired with a high resolution camera of 285 Fig. 3 . APEX dataset and example objects (water, courts, field, trees) spectral channels covering a range of 413mm -2421mm (Fig.  3) . We test 6 sub-images, each containing a distinct object of interest marked by hand as groundtruth (roads, water, etc.). The HW architecture was implemented on Zynq 7Z045-1 FPGA, whereas the SW execution was tested on ARM Cortex A9 embedded CPU (667MHz) and Intel Core i3-3110M CPU.
A. FPGA Performance Evaluation
The most indicative results of our design space exploration on FPGA are summarized in Table I , which includes 8 HW kernels of distinct input and configuration. The 10-bit channels vary from C=16 to C=256, per pixel, and the signatures vary from S=32 to S=128. Channel parallelism P C varies from 16 to 64, signature P S varies from 8 to 32, pixel parallelism is 1. Overall, the logic resource utilization depends mostly on the product P C×S and the metric (L 1 , L 2 , or χ 2 ). For each metric, separately, the resources increase almost proportionally to P C×S . Compared to L 1 , the use of L 2 or χ 2 increases the DSP utilization by P C×S (1 DSP per MULT), while χ 2 increases also the LUT-DFF cost by more than 5x due to the pipelined dividers. Such huge cost increase limits severely the room for parallelism in the FPGA (e.g., to only P C×S = 32× 8 for χ 2 ).
The achieved throughput (Megapixels/sec, col. 8, Table I ) depends on the maximum clock frequency (col. 7) and the ratio of requested work over processing units, i.e., C ×S/P C×S (=amount of cycles between new pixels entering our pipelined kernel). The achieved Mpix/sec ranges from 22 to 107 and increases even more via full paralellization depending on application. We compare total execution time to SW execution on CPU, e.g., for a Mpixel image; col. 9, Table I, shows that the speedup of our HW kernel, alone, ranges from 40x to 355x vs Intel Core-i3. Put into perspective, for C×S = 16×64, while Zynq completes an image in 13 msec (with realistic 12 Gbps PS-PL bandwidth), the single-threaded SW (gcc -O3) requires 0.75 to 3.59 sec on Intel Core-i3 and 6.6 to 109 sec on ARM Cortex-A9 (time depends on metric, the speedup against A9 is in the area of 600x and increases even to 10 4 x for χ 2 ).
B. Quantitative Evaluation of the Similarity-based Matching
The accuracy of HW matching was evaluated with the standard measures [13] of correctness, completeness, overal quality, and weighted (like completeness, but false positives decrease the numerator). Table II reports the most representative of our results for all three matching metrics and two signature ROM scenarios. In the first scenario, we store a signature by selecting a pixel of the original image to represent our object. In the second, we store 4 pixels-signatures per object and matching to anyone of them signifies detection. The results show that examining multiple signatures per object improves consistently the overall quality, even compared to a mean signature (average of multiple pixels, it can be also included in the set of 4). This finding justifies the concept of exploiting FPGA parallelization to examine numerous signatures on-thefly (multiple signatures, for multiple objects). Furthermore, the accuracy improves with χ 2 , however, the gain is smaller than expected, especially if we consider a post-processing phase masking small errors (e.g., majority voting, class aggregation).
C. Implementation Trade-offs and Comparison to Prior Works
Combining the above results leads to useful conclusions. On one hand, employing more complicated metrics requires 3-8x more logic resources than L 1 (assuming LUT-only implementations) to achieve similar throughput, however, the accuracy gain is questionable, especially for L 2 . On the other hand, using 4 signatures per object requires 4x more resources/parallelism to achieve high throughput, however, the accuracy gain is consistent, in almost all cases, and reaches up to 14%. Therefore, multi-signature L 1 configurations should be preferred (with S varying among objects, e.g., S=1 for water).
Compared to similar works in the literature, the proposed architecture proves to be more hardware-efficient and fast. In particular, comparing to [11] and assuming a configuration with χ 2 , C × S = 64 × 128, and P C×S = 32 × 8, then we achieve processing rate 11.2 Mpixel/sec on Zynq (with 81K LUTs at 357MHz, and 7.2 Gbps PS-PL bandwidth), whereas [11] achieve only 1.14 Mpixel/sec on Virtex5 (to loop over all signatures, with 13.4K LUTs at 142MHz). To make the comparison even more fair/accurate, we reduce our clock frequency to Virtex5 technology (251MHz) and compute the hardware efficiency as the ratio of throughput over LUT6 resources: our efficiency is still 14% higher than [11] (97 vs 85) and increases by approx. 4x when employing the L 1 kernel after our trade-off analysis (with limited loss of accuracy). Furthermore, compared to [12] , we achieve approx. 100x higher throughput (at P C×S = 64×8 and, for fairness, C×S = 128×64 with L 2 ) while using FPGAs of similar size.
V. CONCLUSION
We proposed a highly-parallel parametric architecture to detect on-the-fly numerous hyperspectral signatures/pixels via similarity-based matching. Implemented on Zynq XC7Z045 FPGA, the HW throughput increases almost proportionally to cost to sustain 22−107 Megapixel per second with various matching metrics, e.g., L1, L2, χ 2 , by consuming 8−82K LUTs. Depending on configuration, the HW speedup is 40−355x vs SW execution on Intel Core-i3 and 360−10 4 x vs ARM Cortex-A9. Evaluated with the APEX dataset, the FPGA provides overall detection quality of 56−97% depending on image and object. Our accuracy-speed-cost exploration showed that the most efficient metric for straightforward matching is L1 and, also, that using multiple signatures per object consistently improves the L1 detection accuracy by up to 12%.
