Remotely sensed hyperspectral images provide data of the earth's surface components. The data provided is collected through airborne devices such as satellites with the capability to collect large amounts of data to be sent to ground stations for processing. The main disadvantage of this scenario is the limited bandwidth connection between the airborne devices and the ground station on Earth which affects the information sending and real time processing. A possible solution is to include an on-board data processor. Field-Programmable Gate Arrays (FPGAs) are excellent target platform that allows the design reconfigurability, powerful computing and high performance levels. One of the most commonly used techniques in hyperspectral data analysis is linear spectral unmixing. In the last decade, L 1/2 sparsity constrained Nonnegative Matrix Factorization (NMF), a linear spectral unmixing algorithm, and its extensions have been heavily studied to unmix the hyperspectral images and recover their material spectra. L 1/2 regularizer is proven to have much better results in terms of sparsity and accuracy than other regularizers yet, to the best of our knowledge, has not been implemented. In this paper, we present an FPGA design for the L 1/2 sparsity constrained NMF (L 1/2 -NMF) algorithm. The proposed design is tested on both synthetic and real data sets and implemented on Altera Family FPGAs. Implementation results show that the proposed design successfully unmixes the data with maximum frequency of 52.6 MHz and a speedup factor of 3.9 for the synthetic data set and a frequency of 104.32 MHz and a speedup factor of 1.14 for the real data set. The implementation results are compared to the simulation results and ground truth signatures using Spectral Angular Distance (SAD) measure. Calculations show that the implementation results have comparable SAD values to the simulation results.
I. INTRODUCTION
The launch of earth observation satellites has enhanced the field of remote sensing considerably in the past few years. Remotely sensed images cover many applications such as mineral exploration, military surveillance and environmental monitoring [1] . A common problem that surfaces is the existence of mixed pixels in the captured image. This problem occurs due to the low spatial resolution of the sensor and the captured surface variability when capturing a hyperspectral image [2] . This problem is handled via spectral unmixing The associate editor coordinating the review of this manuscript and approving it for publication was Xian Sun . algorithms. The spectral unmixing algorithms aim to decompose a mixed pixel into a collection of material spectra named end-members and each end-member's contribution in the pixel which is named the fractional abundance [3] . The idea of spectral unmixing is explained more in figure 1 [4] .
Mainly, two types of spectral mixture models are used to describe the mixture in a pixel: linear and non linear mixing models [5] . In the literature, most of the existing approaches that solve the unmixing problem are based on the linear mixing model [4] . The linear mixing model based approaches are classified into geometrical and statistical approaches [6] . The geometrical approaches are based on the fact that the pixel observations of a hyperspectral image are confined within VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ FIGURE 1. Explanation of spectral unmixing concept [4] .
a simplex and its vertices represent the end-members [7] . Finding the simplex vertices is the solution to the unmixing problem.
In [8] , the authors introduce the Pixel Purity Index (PPI) algorithm which aim to find the purest pixel through repeated projection onto a random unit vector and finding the endmembers through automatic unmixing [9] . In [10] , the author introduces the N-FINDR algorithm which targets finding the maximum volume simplex inscribed within the data set through non-linear inversion. In [11] , the authors propose Automated Morphological End-member Extraction (AMEE) which tries to find the pure pixels in the data through combining the spatial and spectral information of the data. The authors then use mathematical morphology, a technique applied to the spectral domain while reserving the spatial characteristics of the data. The authors provide Spectral Angular Distance (SAD) results that are comparable to the SAD values of the PPI and N-FINDR algorithm. In [12] , the authors introduce the Vertex Component Analysis (VCA) algorithm. VCA algorithm, like other geometrical based approaches, assumes the presence of pure pixels in the data and repeatedly projects the data onto a a subspace orthogonal to the subspace spanned by the already determined endmembers. The algorithm stops when all end-members are found. The VCA algorithm provides results better than PPI algorithm and better than or similar to N-FINDR algorithm. VCA algorithm has less complexity than PPI and N-FINDR algorithm. In [13] , the authors introduce the Orthogonal Bases Algorithm (OBA) which utilize the same concepts of the N-FINDR algorithm while replacing the volume matrix calculation with Gram-Schmidt iterative orthogonalization. The authors provide SAD results that are comparable to PPI, N-FINDR and VCA algorithms. Despite providing good results in extracting end-members from the hyperspectral image, the geometrical based approaches depend on the assumption of having a pure pixel in the data. This assumption can not always be true due to the low resolution of the data and the highly mixed featured of the target image [1] .
The statistical based approaches overcome the weakness of the geometrical based approaches. The statistical based approaches do not assume the presence of a pure pixel of each end-member in the data. The statistical based algorithms utilize the statistical features of the data. One of the most common algorithms based on the statistical approach is Iterated Constrained End-member (ICE) [14] . The authors in [14] handle the unmixing problem as an optimization problem and propose an objective function to be minimized. Most recently, the authors in [15] proposed an unmixing method based on Gaussian Mixture Model (GMM), super-pixel segmentation (SS) and low-rank representation (LRR). The authors provide SAD results comparable to SAD results in [16] - [18] . In [19] , the authors provide a comparison of VCA, PPI and Optical Real-time Adaptive Spectral Identification System (ORASIS) algorithm [20] for mineralogical unmixing of hyperspectral data. ORASIS algorithm is a collection of stepwise algorithms working together to produce a set of end-members that may not be included in the data set. The four stages of ORASIS algorithm are prescreener, basis selection, end-member selection and unmixing. The authors use a highly mixed synthetic data with maximum purity of 0.48%, AVIRIS dataset of Cuprite, Nevada and Hyperion dataste of the Dost-Bayli area, Ardabil. The authors show that ORASIS outperforms both VCA and PPI algorithms.
One of the most widely used statistical approaches to solve the unmixing problem is Non-negative Matrix Factorization (NMF) [21] , [22] . NMF approximates the data as the multiplication between two non-negative matrices. The first matrix contains the end-members and the second matrix contains the fractional abundances [23] . NMF algorithm results lack high accuracy because of the non-convexity of its objective function. Researchers in the literature add regularizing terms to the objective function to force certain constraints and improve the accuracy of the NMF algorithm. In [1] , the authors introduce the Minimum Volume Constrained NMF algorithm (MVC-NMF) which exploits the fact that the simplex volume determined by the end-members is the minimum among all possible simplexes that circumscribe the data scatter space. The authors compare their SAD results to VCA algorithm and prove that MVC-NMF algorithm outperforms VCA algorithm. In [24] , the authors propose a modified version of the MVC-NMF algorithm where they apply the first stage of (ORASIS) algorithm in order to reduce the large data set size. This modification overcomes the shortcoming of MVC-NMF in dealing with large data. The authors test the modified technique on both a set of noisy synthetic data and Hyperion image of Dost-Bayli located in the Ardabil province in northwestern Iran. In [25] , the authors introduce Constrained NMF (CNMF). The authors add a regularization term to enforce smoothness constraint over the endmembers matrix. The authors compare CNMF SAD results to that of NMF and CNMF has better SAD values. In [26] , the authors introduce Piecewise Smooth NMF with Sparseness Constraint (PSNMFSC). Piecewise smoothness corresponds to smooth variation of the data while sparseness is a property of hyperspectral data where each pixel is a mixture of some end-members of the total number of end-members in the scene. The authors compare the SAD results of PSN-MFSC to VCA and MVC-NMF algorithms. The experiments show that PSNMFSC has the best results compared to VCA and MVC-NMF. In [6] , the authors introduce Abundance Separation and Smoothness Constrained NMF (ASSNMF). The abundance separation constraint minimizes the mutual information between the abundance distributions of different end-members while the abundance smoothness constraint is added based on the fact that minimum abrupt changes happen and variations occur within the image. The authors compare ASSNMF algorithm with four other algorithms namely CNMF, PSNMFSC, MVCNMF and VCA. The authors find ASSNMF to have superior SAD results. In [27] , the authors introduce double abundance characteristics constrained NMF (DAC 2 NMF). The authors measure the smoothness levels of each pixel pair according to the similarities between them by taking advantage of the spectral information of the data. The authors also avoid incorrect smoothness constraints by assigning zero smoothness level to the pixels not similar to the observed pixel. The authors also add a separation constrain to prevent over-smooth results. The authors compare DAC 2 NMF algorithm to MVC-NMF, ASSNMF, L 1/2 -NMF [3] and GLNMF [28] . The SAD results of DAC 2 NMF algorithm are comparable to the four aforementioned algorithms.
Many authors focus on the sparsity based constraints and its extensions. Sparseness refers to using a few end-members out of the whole present number of end-members in the scene to represent the pixel. In [29] , the author add a sparseness constraint based on L 1 norm to utilize the sparse distribution of data. In [30] , the authors propose a new method to calculate the sparseness measure and integrate it in the NMF cost function named NMF-SMC. The authors compare their SAD results to those of VCA and MVC-NMF. The results show that NMF-SMC outperforms both VCA and MVC-NMF algorithms. In [31] , the authors introduce NMF with Data-Guided Constraint (DGC-NMF). The authors propose estimating the abundance map of the data by using unconstrained NMF as a first step. The authors use both L 1/2 and L 2 regularizers to control the sparsity over the abundances. The authors use the L 1/2 regularizer to control sparsity over pixels with high sparsity levels. The authors use the L 2 regularizer to control sparsity over pixels with low sparsity levels. The authors compare the DGC-NMF algorithm to VCA, unconstrained NMF, L 2 -NMF and L 1/2 -NMF algorithms. The authors show that DGC-NMF has comparable SAD results to the previously mentioned algorithms. In [32] , the authors introduce a linear hyperspectral unmixing method based on L 1 -L 2 sparsity and Total Variation (TV) regularization (L 1 -L 2 SUnSAL-TV). This algorithm forces strong sparsity through calculating the difference between the first and second norms. The algorithm also incorporates spatial correlation information by adding a TV constraint to force spatial smoothness. The authors compare (L 1 -L 2 SUnSAL-TV) to six other methods among them are Sparse Unmixing model via variable Splitting and Augmented Lagrangian (SUnSAL), collaborative SUnSAL (CLSUnSAL) and Sparse unmixing using Spectral a Priori Information (SUnSPI). The authors show that (L 1 -L 2 SUnSAL-TV) outperforms the other algorithms in comparison. In [33] , the authors introduce bilateral filter regularized L 2 sparse NMF (BF-L 2 SNMF).The authors add a L 2 regularizer to enforce sparseness. The authors also add a bilateral filter regularizer to utilize the correlation information between the abundance vectors. The authors show that BF-L 2 SNMF SAD results are comparable and sometimes better than VCA, L 1/2 -NMF and GLNMF algorithms. L 1/2 regularizer is better in sparsity representation than L 1 and L 2 regularizers [3] , [34] . In [3] , the authors introduce the L 1/2 -NMF algorithm. The authors add an L 1/2 regularizer to the objective function to enforce sparsity over the abundances. The authors compare the L 1/2 -NMF algorithm SAD results to VCA, MVC-NMF, L 1 -NMF and PSNMFSC algorithms. The authors show that L 1/2 -NMF outperforms the algorithms in comparison. In [28] , the authors introduce Graph-regularized L 1/2 -NMF algorithm (GLNMF). The authors add graph regularization constrain where a nearest neighbour graph is built to incorporate the internal structure information of the data. The authors compare the SAD results of GLNMF algorithm to VCA, MVC-NMF and L 1/2 -NMF. The authors show that GLNMF results are superior and more noise tolerant. In [2] , the authors introduce iterative half-thresholding L 1/2 -NMF (HTL 1/2 -NMF). The HTL 1/2 -NMF takes advantage of the a priori known sparseness information and integrate it in optimization of the algorithm to adaptively adjust the regularization parameter through while the algorithm is iterating. The authors compare the SAD results of HTL 1/2 -NMF to VCA, L 1 -NMF and L 1/2 -NMF algorithms. The authors show that HTL 1/2 -NMF provides more accurate and much sparser results.
A problem that arises with hyperspectral unmixing is the limited bandwidth of the channel between the satellite and the earth station [35] . One solution to this problem is the addition of an on-board processing resource to reduce the size of the data transmitted to the Earth station [36] . An advantage of on-board processing is instead of sending the whole hyperspectral image, the resulting features (end-members and abundances), which are much smaller in size, are transmitted to the ground station. In recent years, many researchers contributed to the implementation of the hyperspectral unmixing algorithms on different target platforms. In [37] , the authors implement the VCA algorithm on NVIDIA Fermi GPU. The authors state that the design is faster than the MATLAB and C versions of the algorithm. GPUs have a significant disadvantage in on-board processing as they are limited in temperature, memory and energy consumption [37] . In [38] the authors implement a modified version of the VCA algorithm (MVCA) on Nvidia GPU using Open Computing Language (OpenCL). The proposed MVCA implementation has a speedup factor of 115 when compared to a sequential implementation of the VCA algorithm. In [39] , the authors use a systolic array-based FPGA implementation of the PPI algorithm on a Xilinx Virtex-II XC2V6000-6 FPGA. In [40] , the authors implement the VCA algorithm on a low cost Xilinx Zynq board with a Zynq-7020 System on Chip (SoC) FPGA based on the Artix-7 FPGA programmable logic. The authors in [36] implement the NFINDR algorithm on Virtex-4 XC4VFX60 FPGA. The authors state that the implementation of the NFINDR algorithm is superior in computational time compared to an equivalent C software version. In [41] the authors implement the Hysime algorithm used for estimating the number of end-members in the hyperspectral image [42] . The Hysime algorithm is implemented on a Virtex-7 XC7VX690T FPGA. In [43] the authors implement Fast Automatic Target Detection algorithm (Fast-ATGP) on a Xilinx Virtex-7 XC7VX690T FPGA.
Field-Programmable Gate Arrays (FPGAs) are the most common and most suitable resource for on-board processing of hyperspectral unmixing algorithms [41] . FPGAs have smaller size and weight while having also lower power consumption compared to other platforms like Graphical Processing Units (GPUs) and multicore processors [44] , [45] . FPGAs are also suitable for being used in satellite payload as we currently have FPGAs with increased resistance to space ionizing radiations [46] . FPGAs also provide the ability of changing their functionality through reconfiguration [41] , [47] .
Despite outperforming the PPI, NFINDR and VCA algorithms [3] , L 1/2 NMF algorithm is not yet implemented on any target platform. This paper presents, to the best of our knowledge, the first FPGA based architecture L 1/2 NMF algorithm. The paper provides real-time synthesis results, comparison with MATLAB simulation results and ground truth spectra.
The rest of the paper is organized as follows: in section II we discuss the mathematical approaches of the linear mixing model (LMM) and the L 1/2 -NMF algorithm. In section III we introduce the proposed FPGA implementation of the L 1/2 -NMF algorithm. In section V we discuss the implementation and simulation results. Finally we conclude our paper in section VI.
II. MIXING MODELS AND L 1/2 -NMF ALGORITHM A. MIXING MODELS
Mixing models are classified to linear and non linear models [5] . The linear mixing model (LMM) is valid when end-members follow a discrete distribution where each endmember does not interfere with other end-members [48] . Under the linear mixing model, it is assumed that the endmembers are mixed linearly as stated in equation 1
where X ∈ R L×N is the hyperspectral image, L is the number of bands and N is the number of pixels. A ∈ R L×K is the endmembers signature matrix and K is the number of estimated end-members in the image. Each column of the matrix A represents an end-member spectrum. S ∈ R K ×N is the fractional abundances matrix. On the other hand, the non linear mixing models add a non linear part to the linear mixing model in equation 1 as follows;
The term µ(.) defines an additive non linear term that depends on the end-member matrix A, the abundance coefficients in S and additional non linearity coefficients b which adjusts the amount of non linearity in the pixel.
The L 1/2 -NMF algorithm follows the LMM. The only known term in equation 1 is the hyperspectral image X. In order to find A and S we perform L 1/2 -NMF to minimize the distance between X and AS. Many methods can be used to measure such difference. The algorithm uses the Euclidean distance method and it is given as follows:
where
and
S 1/2 represents the sum of the element-wise square root of the elements of the matrix. λ is a factor that weighs the contribution of the L 1/2 regularization term. In order to solve the cost function in equation 3 multiplicative iterative algorithm is used to estimate A and S. When applied, the values in A and S are updated as follows:
where (.) T represents matrix transpose and.* and./ represent element-wise multiplication and division, respectively. S − 1 2 is the element-wise inverse square root of S. In order to ensure the full additivity constraint for the fractional abundances is achieved, both matrices X and A are augmented with a row of constants as follows:
where δ is a positive factor that controls the achievement of full additivity constraint. δ value is in the range between 10 and 20 to balance between convergence rate and accuracy of the estimation. Another important problem is dividing by zero in the term S − 1 2 . Any zero value is replaced by a small value to avoid trivial results [3] .
The algorithm stops the optimization process if one of two conditions is met. The first condition is a set number of iterations. The second condition is the distance between the current iteration and the previous one is less than a certain value . Figure 2 shows the flow chart of the L 1/2 -NMF algorithm. The L 1/2 -NMF algorithm can be summarized as follows [3] 
III. PROPOSED FPGA IMPLEMENTATION OF THE L 1/2 -NMF ALGORITHM
This section describes the detailed implementation of the L 1/2 -NMF Algorithm. As shown in Figure 3 , the proposed hardware architecture of the L 1/2 -NMF Algorithm contains an off-chip memory to save the augmented hyperspectral image, memory controller to select the data from the required addresses and the system main modules namely Cost Function Calculation Module, Update A Module and Update S Module implementing equations 3, 6 and 7 respectively. Figure 4 shows the architecture of the Cost Function Calculation Module. We divide the cost function to two main subblocks. The first sub-block is represented by X f − A f S 2 2 . The second sub-block represents the L 1/2 norm calculation of the abundances matrix S represented by S 1/2 . After calculating the first term, we multiply the resulting value by half then add the result to the multiplication of the second term by λ. The value of λ is calculated on an external processor and passed to the FPGA. The resulting value of the cost function calculation is saved in a shift register. While the algorithm is iterating, the value of the previous iteration is shifted to the second location inside the register. We take the difference between the current iteration and the previous one then pass the result to a comparator. The comparator compares the subtraction result with the predefined value to check the first stopping condition. If the subtraction value is greater than the value of then the iterations counter is incremented and the algorithm starts the next iteration. If the subtraction value is less than or equal to the value of then the algorithm stops iterating.
A. COST FUNCTION CALCULATION MODULE
As shown in Figure 5 , we start the calculation of the cost function by implementing the multiplication of the augmented matrix A f by the abundances matrix S. Both matrices are initialized and implemented as RAM blocks. Two control blocks are used to control the read,write processes, the addresses generation and the multiplication process. Control block 1 controls the writing addresses generation and the read and write enables. Control block 2 is used to generate the required reading addresses, the multiplication enable and the accumulation enable. The multiplication process here is a matrix multiplication process which is divided into a multiplier and an accumulator in our proposed design. We perform the element-by-element multiplication and pass the result to the accumulator until the vector-by-vector multiplication is completed. Figure 6 shows the next stage of calculating the cost function. The output of stage 1 is subtracted from the corresponding value in the augmented matrix X f . The resulting subtraction value is then squared and passed to an accumulator. This process continues until all values are covered to calculate the first term of the cost function equation X f −A f S 2 2 . In addition to the previously mentioned modules, we have two control blocks. The first block controls and synchronizes the subtraction, squaring, and accumulation processes with each output of the first stage. It also has the read enable sent to the off-chip memory. The second block generates the required addresses to read from the off-chip memory.
The second term of the cost function equation S 1/2 is implemented as shown in Figure 7 . The L 1/2 term is implemented as in equation 4. The square root module is implemented in three blocks as shown in Figure 8 . Altera floating point square root IP is used along with two converters from fixed point to floating point and vice versa. The square root module needs 40 clocks to perform the operation.
B. UPDATE A MODULE
In this section, we illustrate the hardware implementation of the module given by equation 6. As shown in Figure 9 , equation 6 can be broken down to numerator and denominator terms. Each of the terms is handled separately then the element-by-element division and multiplication operations are performed at the end. The numerator term is divided by the denominator term and the output is multiplied using and element-by-element multiplier by the end-members matrix A. Two control blocks are used for generation of required reading addresses, read enable, division and multiplication enables. The numerator term values are divided by the denominator term values in an element-by-element operation. Next, the division result is multiplied by the corresponding value in the end-members matrix A. The resulting values are saved into A RAM block with the same dimensions of the end-members matrix A.
We start by implementing the matrix multiplication (SS T ) as shown in Figure 10 . We use two RAM blocks representing the abundances Matrix and its transpose. In addition, a control block is used for read and write control signals, write addresses and read enable. Another control block is used to generate the read addresses, multiplication and accumulation enables.
Next, we multiply the result of stage 1 by the matrix A as done is stage 1. The result of stage 1 is saved in a memory element and accessed through control signals. Both matrix multiplication processes are broken down to a multiplier and an accumulator as shown in Figure 11 . As shown in Figure 12 , The numerator term is implemented in the same manner as the denominator using the off-chip memory, a randomly initialized memory element, a control block with control signals for multiplier, accumulator, read and write processes and another control block for reading addresses.
Next we implement the element-by-element division of the numerator and the denominator terms. For this module we use a floating point divider as the fixed point divider outputs are quotient and remainder while the floating point divider output is a rational number. As shown in Figure 13 denominator terms as shown in Figure 14 . The resulting values of the numerator and denominator terms are passed to the element-by-element divider. The divider result is multiplied by the abundances matrix S and the result is saved into a RAM block with the same dimensions as the abundances matrix S. Two control blocks are used for generation of required reading addresses, read enable, division and multiplication enables. The resulting value are saved into A RAM block with the same dimensions of the abundances matrix S.
We start by implementing the matrix multiplication between the transpose of the augmented end-members matrix A T f and the augmented matrix X f as shown in Figure 15 . As previously mentioned, the matrix multiplication is broken down to a multiplier and an accumulator.
The denominator has two terms. The matrix multiplication A T f A f S, is implemented through two stages. Figure 16 shows the first stage implementing the matrix multiplication A T f A f . The end-members are saved in two RAM blocks with dimensions L by K and K by L. Two control blocks are used for required addresses generation and control signals generation. The matrix multiplication is broken down to a multiplier and an accumulator as in the previous modules.
The output of stage 1 is saved into a RAM block and multiplied by the abundances matrix S. As shown in Figure 17 , the matrix multiplication is performed on two stages as a multiplier and an accumulator. Control blocks are needed to generate control signals for the multiplier and accumulator enables, read and write enables and generating reading addresses.
The second term of the denominator is implemented through two stages. In the first stage, we calculate the element-wise inverse square root of the abundances matrix S. Before calculating the inverse square root, the input value is tested. If the value is zero then it is replaced by a very small value. The inverse square root module is implemented in 3 blocks. Altera floating point inverse square root IP is used along with two converters from fixed point to floating point and from floating point to fixed point. The inverse square root module needs 49 clocks to perform the operation as shown in Figure 18 .
Finally, the inverse square root module output is multiplied by the value λ 2 then added to the first term of the denominator. The aforementioned multiplication and addition processes are element-by-element operations. Figure 19 shows the implementation of the denominator term. Control blocks are added for control signals generation to synchronously enable the multiplier and the adder, read and write enables and required addresses generation.
IV. COMPLEXITY ANALYSIS
In this section, we discuss the computational complexity of the L 1/2 -NMF algorithm relative to some other algorithms present in the literature. As we mentioned in section I, the L 1/2 -NMF algorithm outperforms NFINDR, PPI and VCA algorithms [3] . However, the L 1/2 -NMF algorithm has higher complexity than NFINDR, PPI and VCA algorithms. This is due to the fact that the L 1/2 -NMF algorithm computes both the end-members and the abundances matrices while NFINDR, PPI and VCA algorithms compute only the end-members matrix. In general, statistical based approaches have higher complexity than geometrical based approaches [4] . Table 1 shows the computational complexity of the NFINDR,PPI, VCA and L 1/2 -NMF algorithms.
where L is the number of bands, K is the number of endmembers to be detected and N is the number of pixels present in the scene, 2.3 < η < 2.9 and s is the number of skewers for the PPI algorithm.
V. EXPERIMENTAL RESULTS

A. SIMULATION AND IMPLEMENTATION TOOLS
The hardware implementation of the L 1/2 -NMF algorithm described in section III is designed using Mentor Graphics FPGAdv 8.1. For the synthetic data set, the simulations are performed using ModelSim SE PLUS 6.3a tool. The proposed design is synthesized using Intel Quartus II 15.0 web edition. Altera Cyclone V device number 5CGXFC9E7F35C8 [49] is chosen as the target hardware platform. For the real data set, the simulations are performed using ModelSim 10.1c tool and the proposed design is synthesized using Intel Quartus Prime 15.1 standard pro edition. Altera Arria 10 device number 10AS048E1F29E1SG [50] is chosen as the target hardware platform. This is due to the limitation of the previously used target platform for the synthetic data set experiment. The chosen FPGA for the real data set experiment has more resources and is of a newer technology. The Arria 10 FPGA uses 20 nm technology while Cyclone V FPGA uses 28 nm technology. The newer technology allows the FPGA to operate at higher frequency and to have more resources which we show later in section V-C.5. The implementation results are compared with simulation results performed on MATLAB R2018a with Windows 10 operating system on Intel Core TM i7-4790 CPU@3.6 GHZ and 16 GB memory.
B. DATA SETS 1) SYNTHETIC HYPERSPECTRAL DATA SET
We test the proposed implementation of the L 1/2 -NMF algorithm on synthetic data prepared as mentioned in [12] . A simulated scene of dimensions 188 × 400 is generated satisfying equation 1. In section III, we show that the proposed architecture of the L 1/2 -NMF algorithm depends on memory blocks in many steps of the design. Since this is the first implementation of the L 1/2 -NMF algorithm, we choose to test our proposed design on a smaller scene with smaller number of end-members to avoid memory overflow and validate our results compared to our simulation results. Three spectral signatures are selected from the U.S. Geological Survey (USGS) digital spectral library [52] . The selected spectra are Carnallite NMNH98011, Ammonioalunite NMNH145596 and Biotite HS28.3B as shown in Figure 20 .
Each value in the dataset is converted in MATLAB to binary representation. The binary value is of length 64 bits with 15 bits representing the integer part and 48 bits representing the fraction part. The choice of 64 bits representation is to provide more accuracy during calculations especially while truncating the outputs of the multipliers and dividers. Inputs and outputs data type is a standard logic vector 64 bits. Intermediate signals and RAM addresses are of an unsigned type.
2) REAL HYPERSPECTRAL DATA SET
In this section, we use real hyperspectral data set to test the proposed FPGA hardware implementation of the VOLUME 8, 2020 Figure 21 shows the region of 95 × 95 pixels of the Samson hyperspectral data set.
This data set consists of three end-members representing water, soil, and tree. Figure 22 shows the spectra of the three end-members. The image consists of 156 channels covering the wavelengths from 401 nm to 889 nm.
As previously mentioned for the synthetic data set, each value in the dataset is converted in MATLAB to binary representation. The binary value is of length 64 bits with 15 bits representing the integer part and 48 bits representing the fraction part. The choice of 64 bits representation is to provide more accuracy during calculations especially while truncating the outputs of the multipliers and dividers. Inputs and outputs data type is a standard logic vector 64 bits. Intermediate signals and RAM addresses are of an unsigned type. Figure 23 shows the implementation of the Cost Function Calculation module. We calculate the first term in equation 3 X f − A f S 2 2 in the first unit ''Xf_AfS''. The second term 1 Available here: http://lesun.weebly.com/hyperspectral-data-set.html S 1/2 is calculated in the second unit ''S_half''. The resulting values from the first and second units are passed to the ''objfunc_adder'' unit. The first value is multiplied by half and the second value is multiplied by λ then the values are added to calculate the cost function value. The cost function value is stored inside the shift register ''register_2''. While the algorithm is iterating, the value is shifted inside the register and the new iteration value is stored in its place. Both values are subtracted and the result is compared with the threshold value in the ''diff_comp'' unit. The ''diff_comp'' unit also has a counter for the number of iterations to which counts up to the maximum number of iterations in the second stopping condition. The ''control_objfunc'' controls the save and shift processes in the register. Figure 24 shows the simulation results of the Cost Function Calculation module. The cost function value ''objvalue'' is calculated and stored in the register. The cost function value is then subtracted from the previous value and compared to the threshold value . The result is the ''en'' signal indicating the need for more iterations or the end of the optimization. Figure 25 shows the implementation of the Update A module. As previously explained in section III-B, The numerator term is calculated in ''XSTsaved'' unit and the denominator term is calculated in the ''ASSTsaved'' unit. Each element in the numerator term is divided by its corresponding element in the denominator term inside the ''A_Divider'' unit. The division result is then multiplied by the corresponding element in the end-members matrix A inside the ''mult_by_A'' unit and stored in a RAM block. Figure 26 shows the simulation results of the Update A module. As previously mentioned, after updating the endmembers matrix values, the resulting values are stored inside a RAM block. These resulting values are passed to the other two main modules; Cost Function Calculation module and Update S module; to complete the optimization steps. The updated end-members matrix also replaces the old one in the next iteration. After the algorithm completes the optimization, the values of the end-members matrix are exported to a text file and read inside MATLAB software to compare the implementation and simulation results. Figure 27 shows the implementation of the Update S module. As previously explained in section III-C, the denominator terms are calculated in ''AfTAfS_saved'' and ''S_neg_ half_saved'' units. Each value from the ''S_neg_half_saved'' unit is multiplied by λ 2 and added to the corresponding resulting value from the ''AfTAfS_saved'' unit inside the ''S_adder'' unit. The resulting value from ''AfTXf_saved'' is divided by the corresponding value resulting from the ''S_adder'' unit using the ''S_divider'' unit. Finally, the resulting value from the ''S_divider'' unit is multiplied by the corresponding value in the abundances matrix stored in the ''RAM_3_400_controlled'' unit. The final values are stored in a RAM block for updating and reuse. Figure 28 shows the simulation results of the Update S module. As previously mentioned, the denominator terms are calculated while the numerator term is calculated. Next, the numerator term is divided by the denominator term then multiplied by the abundances matrix. The resulting values are stored in a RAM block to update the old values in the abundances matrix and to be passed to the other modules to continue the optimization process.
C. IMPLEMENTATION RESULTS
1) COST FUNCTION CALCULATION MODULE IMPLEMENTATION RESULTS
2) UPDATE A MODULE IMPLEMENTATION RESULTS
3) UPDATE S MODULE IMPLEMENTATION RESULTS
4) SYNTHESIS RESULTS
The proposed hardware implementation of the L 1/2 -NMF algorithm for the synthetic data set is synthesized using Intel Quartus II 15.0 web edition. For the real data, the proposed design is synthesized using Quartus Prime 15.1 standard pro edition which supports newer FPGA families. Table 2 summarizes the resources used for the proposed hardware implementation of the L 1/2 -NMF algorithm for both synthetic and real data sets.
We implement our FPGA design for the synthetic data set on Altera Cyclone V 5CGXFC9E7F35C8 FPGA. The proposed FPGA design of the L 1/2 -NMF algorithm occupies 9127 logic elements, 17133 registers, around 1.26 Mb of memory and 192 DSP blocks. The proposed design operates at 52.6 MHz maximum frequency. Our tests show that the simulation run time for the algorithm on MATLAB is 52.11 seconds while the implementation run time on FPGA is 13.33 seconds which results in a speedup factor of 3.9. In Table 2 , we show the run time of each module per iteration since the algorithm can stop optimizing at different iterations according to the stopping conditions. For the real data set, we implement our FPGA design on Altera Arria 10 10AS048E1F29E1SGX FPGA. The proposed FPGA design of the L 1/2 -NMF algorithm for the real data set occupies 124474 logic elements, 258479 registers, around 17 Mb of memory and 190 DSP blocks. Our tests show that the simulation run time for the algorithm on MATLAB is 152.89 seconds while the implementation run time on FPGA is 135.46 seconds wich results in a speedup factor of 1.14.
5) RESULTS COMPARISON
The implementation results are compared to MATLAB simulation results and the ground truth signatures [52] using Spectral Angular Distance given by:
Figures 29 and 30 show the results comparison between MATLAB simulation, FPGA results and ground truth signatures of the three selected end-members. The simulated and the implemented results are plotted according to the calculated SAD values. We calculate the SAD values for each detected end-member against the ground truth signatures of of all end-members. We plot the signature of the detected end-member with the lowest SAD value against the ground truth signature of the same end-member. In our implementation, we use δ = 15, small value = 10 −6 , max. iterations value = 3000 and = 10 −3 . Tables 3 and 4 show the SAD values calculated between MATLAB simulation resulting end-members versus their corresponding ground truth signatures and FPGA implementation resulting end-members versus the corresponding ground truth signatures. From the tables, it can be seen that the implementation results have higher SAD values than simulation results. This is due to truncations and conversions that occur within the implemented architecture of the L 1/2 -NMF algorithm. This comes from the fact that after each multiplication or division process, the resulting values are truncated to have the same data length in all the stages of the hardware implementation. In addition, the approximation that results from the conversion from fixed point to floating point representations and vice versa affects the accuracy of the resulting values.
VI. CONCLUSION
In this paper, we presented the first FPGA hardware implementation of the L 1/2 -NMF algorithm. Although L 1/2 -NMF algorithm outperforms many implemented algorithms such as NFINDR, PPI and VCA, it has not been yet implemented in the literature. The proposed design is tested using both synthetic and real hyperspectral data sets. In the proposed design, the mathematical operations are broken down to element-byelement operations. Matrix multiplication is implemented as a multiplier and an accumulator. Each stage of the design has its own control blocks. Two types of control blocks are used in the design. The first type is used to generate read enable, write enable, write addresses, multiplier, adders and subtracter enables. The second control block type is focused on generating read addresses synchronized with blocks outputs. The design is implemented using Mentor Graphics FPGAdv 8.1 and synthesized using Altera Quartus II 15.0 web edition for the synthetic data set and Altera Quartus Prime 15.1 standard pro edition for the real data set. The experimental results were performed on Altera Cyclone V 5CGXFC9E7F35C8 FPGA for the synthetic data set and Altera Arria 10 10AS048E1F29E1SG FPGA for the real data set. The hardware results are compared to the MATLAB simulation results using the same data set and ground truth signatures via SAD calculation. The experimental results show the proposed hardware design successfully unmixes the data set with SAD results comparable to the simulation results. The maximum MATLAB SAD value was 0.1239 while the maximum FPGA SAD value was 0.2663 for the synthetic data set. On the other hand, the maximum MATLAB SAD value was 0.3327 while the maximum FPGA SAD value was 0.3308 for the real data set. The proposed design of the L 1/2 -NMF algorithm works with a maximum speed of 52.6 MHz for the synthetic data set and 104.32 for the real data set. When compared to the MATLAB simulation, the FPGA hardware design has higher speed which resulted in speed up factors of 3.9 and 1.14 for the synthetic and real data sets respectively. In the future, we will exploit pipelined and parallel implementation of matrix multiplication to decrease the run time of the implemented algorithm as well as to better utilize the available resources.
