Abstract: In this study, a scalable coprocessor for accelerating the Differential Evolution (DE) algorithm is presented. The coprocessor is interfaced with PowerPC embedded processor of Xilinx Virtex-5 FX70T Field Programmable Gate Array. In the proposed design, the DE algorithm module is tightly coupled with fitness function module to reduce communication and control overhead. The fixed point DE algorithm is implemented in the coprocessor whereas both fixed and floating point DE are implemented in the embedded processor. Performance of the coprocessor is evaluated by optimising benchmark functions of different complexities. The implementation results show that the coprocessor is 73.14-160.2× and 2.19-27.63× faster compared to the software execution time of the floating and fixed point algorithm respectively. As a case study, spectrum allocation problem of cognitive radio network is evaluated with the coprocessor. Results show an acceleration of 76.79-105× and 5.19-6.91× with respect to floating and fixed point DE in embedded processor. It is also observed that the application occupies 56% of BRAM, 54% of DSP48E, 16% of slice LUTs and maximum frequency of operation as 63.55 MHz in a Virtex-5 FPGA. This type of coprocessor is suitable for embedded applications where the fitness function remains unchanged.
Introduction
Differential evolution (DE) algorithm is an evolutionary computation method and has been applied in diverse domains of science and engineering applications [1] . DE finds optimal values for a set of parameters by making repeatedly pseudo-random changes to their values. The number of parameters are referred as dimension of the problem. After making changes, the algorithm evaluates the fitness of the solution. DE algorithm became a popular evolutionary algorithm because (i) it is simple to implement, (ii) it has better performance in comparison with other evolutionary algorithms (EA) and (iii) it has less number of control parameters and less space complexity [2] . Most of the evolutionary techniques have been implemented in a high end desktop computer/processors to solve optimisation problems. The applications like motion estimation [3] , pole-placement design of infinite-impulseresponse filter [4] , future generation evolvable machines [5] use evolutionary algorithm to derive optimal solutions. These applications generally uses low-performance microprocessors with limited computational resources, rather than highperformance desktop personal computers/processors to execute the evolutionary algorithms. The time consuming evolution process limits the use of evolutionary algorithms in embedded applications. This leads to slow execution speed of the algorithms in embedded processor. In order to meet the real time execution speed requirement, one can either proceed with the parallelisation of the algorithm or implement the design onto the hardware. There are several hardware platforms such as microcontrollers (μC), digital signal processors (DSP), field programmable gate arrays (FPGA) and application specific integrated circuits (ASIC) are used for developing an embedded system. Platforms like μC and DSP are revolving around firmware development using software methodologies rather than development of hardware for the application [6] . FPGA development platforms support both hardware-based approach (system developed entirely in the hardware) and processor-based approach (system developed entirely in the firmware). It has the flexibility to customise the hardware design by adding any combination of peripherals and controllers which are not available in microcontroller or DSP processor-based system. Owing to the above reasons, recently evolutionary algorithms like particle swarm optimisation (PSO), genetic algorithm have been implemented in the FPGA [7, 8] . The execution time of the DE algorithm increases with the increase in complexity of the function to be optimised. Owing to this, DE algorithm is not suitable for implementation in low end processors for real-time/online applications involving complex optimisation. Thus there is an increasing demand to define an architecture and implement the algorithm in the FPGA to meet the real-time execution speed requirement. DE algorithm can be implemented in both embedded processor and hardware using either fixed point or floating point arithmetic. Although floating point DE will give better accuracy but at the expense of high computation cost. In embedded processor, floating point DE reduces the execution speed approximately by 5-40×.
The objective of this work is to implement the DE algorithm in the FPGA platform to accelerate the optimisation speed. This paper focuses on only improving the speed of the optimisation time and not to improve the quality of solutions. For improving the quality of solution different variants of DE algorithm can be explored. The DE algorithm has three major computational operations (i) random number generation (RNG), (ii) objective function evaluation and (iii) updating the solution. For optimising a simple testbench function (Rosenbrock), the profiling result of DE algorithm shows that RNG and the optimisation algorithm (except the fitness function evaluation) consumes 90% of the total execution time. However, for complex functions of higher dimension, this might change, that is, 60% of execution time is for evaluating the Shifted Schwefel's fitness function. Li et al. [8] have suggested a hardware software co-design method for implementing the PSO algorithm. We have observed that while optimising the fitness function like Shifted Schwefel's in the co-design platform, where fitness function is evaluated in the software and the remaining part of the algorithm is implemented in the hardware, the bus communication time is ∼106 ms. In contrast, if the complete DE algorithm along with fitness function is implemented in the hardware, it takes ∼128 ms. This concludes that bus communication overhead is dominating (i.e. 82.8%) the overall hardware execution time. This is observed when the embedded processor was operating at 200 MHz. To reduce the bus communication overhead, the total DE algorithm including the fitness function evaluation can be implemented in the embedded processor of the FPGA. This approach may result to a marginal acceleration in optimisation time. Both these approaches will not give any additional improvement in terms of the execution speed. So the alternate choice is to implement both the algorithm and fitness function evaluation in the hardware and use it as a dedicated coprocessor. For accelerating the computational intensive operations coprocessor based dedicated accelerators have been used [9] .
In this work, a dedicated coprocessor for DE algorithm is developed and integrated with the embedded processor (PowerPC 440) to solve optimisation problems including spectrum allocation (SA) in cognitive radio network [10] . The proposed coprocessor is scalable in terms of the optimisation parameters, maximum number of iterations (G MAX ), population size (NP) and dimension (D). In the proposed design, both the fitness function evaluation and DE algorithm are in a single module rather than in two different modules. The software execution time of both arithmetic of DE algorithm is evaluated and compared with the hardware execution time of the algorithm.
The rest of the paper is organised into ten different sections. Section 2 presents existing literature about FPGA implementation of evolutionary algorithms. Section 3 presents brief introduction about DE algorithm and its software profiling is described in Section 4. The proposed hardware architecture for the DE algorithm is described in Section 5. Section 6 presents system on chip implementation of the DE coprocessor with auxiliary processor unit (APU) interface details. Section 7 presents the experimental setup. Section 8 describes results and analysis of DE coprocessor and Section 9 presents SA in cognitive radio as a case study of real-time application of DE coprocessor followed by conclusions in Section 10.
Related works
In literature, different evolutionary algorithms have been implemented using FPGA as shown in Table 1 . A customised intellectual property (IP) of genetic algorithm was implemented in the Xilinx FPGA and integrated with PowerPC 405 processor based system on chip (SoC) and the speed enhancement up to 5.16× was achieved in Virtex-II Pro development kit [7] . A modular co-design architecture was developed for PSO algorithm [8] , in which particle positions were updated in hardware whereas the fitness function was evaluated on a Nios-II embedded processor. Owing to this approach, the design had a flexibility to modify the fitness functions in the software depending on the applications. With this approach various embedded applications can be developed simply by changing the objective function. This design achieved speedup of 20× in Altera development kit. Hardware architecture of pipelined PSO (PPSO) was developed along with the parallel PSO (pPSO) framework which consists of multiple Nios-II processors using system-on-aprogrammable-chip (SOPC) methodology and resulted speedup of 98× compared to the software implementation of the PSO algorithm in Altera development kit [11] . A modular, flexible and reusable multi-swarm PSO parallel hardware architecture was proposed to overcome the drawbacks of software implementation of the PSO algorithm using a freescale microcontroller and Xilinx MicroBlaze soft processor core [12] . A hardware accelerator for pPSO algorithm was reported and validated its performance by optimising test bench functions on MicroBlaze processor-based SoC in a Virtex-6 development kit [13] . Apart from the above works, different variants of PSO algorithms were implemented on a FPGA without addressing the acceleration of execution speed [14] [15] [16] [17] [18] [19] .
Recently the authors have proposed a floating point implementation of DE algorithm in the SoC, and reported that the DE IP core running at 50 MHz accelerates the execution speed approximately by 200× compared with its equivalent software implementation on a PowerPC 440 processor [20] . Since, there is no work reported in the literature which implements fixed point DE algorithm as a coprocessor suitable for embedded applications. In this work, we have developed a coprocessor for DE algorithm, interfaced with the PowerPC embedded processor and tested its performance by solving mathematical test bench functions and a practical SA problem.
DE algorithm
Basic DE algorithm has three major steps, (i) reading control parameter values, (ii) initialisation of population and (iii) mutation, crossover and selection process. The complete pseudo-code of DE algorithm is given in Algorithm 1 (see Fig. 1 ). The performance of DE coprocessor is tested by optimising a set of numerical test bench functions as tabulated in Table 4 . From this table, it is observed that for Fun1 floating point DE algorithm takes 4721 ms time in contrast to fixed point DE, which takes 70 ms time for execution. This is because of the complexities of floating point arithmetic. In the embedded processor, floating point unit (FPU) is used for all the floating point arithmetic involved in the algorithm. From Table 4 , it is also observed that all the test functions (except Fun6 function) takes less time for evaluation. The value inside the parenthesis refers to the % of total execution time that the particular module requires during execution. 
Hardware architecture of DE algorithm
The architecture of the DE algorithm is shown in Fig. 2 . It has seven modules, that is, memory initialisation, mutation, crossover, selection, random number generator, fitness evaluation and a control finite state machine (FSM) Module. The FSM is shown in Fig. 3 . It has idle, initialisation, operation, waiting and reading states. In the idle state, all the modules are in the reset condition. In the initialisation state, FSM enables memory module when the inputs such as maximum number of generations (G MAX ), population size (NP) and dimension (D) are made available. During the operation state, control FSM enables internal modules according to the different stages of the algorithm, that is, crossover, mutation and selection. FSM will be in wait state until the execution of current module is completed then it will go to the next module for execution. In reading state, FSM reads the fitness value and writes into the output register. www.ietdl.org
Initialisation module
The architecture of initialisation module is shown in Fig. 4 . The memory module has two separate memories, one is for storing the population values (population memory) and other is for storing their corresponding fitness function values (fitness memory). During the initialisation state, population values of all the particles (i.e. of size NP × D) are randomly generated within the range of [X min , X max ], and stored in the population memory of size 4kbytes. The population values are accessed from the population memory by using a 12 bit address. Each population member is of 
Mutation module
After the population is initialised, mutation operation is performed by the mutation module. A mutant vector is generated for every target vector from the current population. In this module a mutant vector of size 128 bytes is generated for each population member. Three distinct vector indices r 1 , r 2 and r 3 are generated in the range of 1 to NP by comparing the counter value with the value of multiplier. These indices are connected to the select lines of a multiplexer (MUX) unit. Three distinct target vectors each of size 1 kbits are obtained from the MUX unit as shown in Fig. 5 . Then the mutation operation is performed by difference of any two of these selected three vectors scaled by a factor F and this difference is added to third one to obtain the mutant vector of size 256 bytes. A mutant vector is generated for all the population member of all dimensions.
Crossover module
The crossover operation is mainly responsible to increase the diversity among the mutant vectors. A trial vector is generated from the output of crossover module with a crossover probability CR as shown in Fig. 6 . This crossover rate controls the diversity of the population and helps the algorithm to escape from the local optima [1, 2], and ensures that the trial vector obtains at least one vector from the mutant vector. The register Reg1 has a random number stored in it. The output of Reg1 and CR are input to the comparator 2 module. The multiplier output and index of population member are input to the comparator Reg1. The output of both comparator 1 and 2 are input to a logic OR gate. The output of crossover module is either the mutant vector or the population vector as selected by the MUX unit.
Selection module
The output of crossover module is the trial vectors. These are input to the selection module as shown in Fig. 7 . The fitness value of trial vector is evaluated by using the fitness evaluation module and if it is less than the fitness of the current population member then it selects the input as trial vector else the current population member is selected as the new population member. The output of MUX is the updated value of the current population memory. This process is repeated for all the iterations to improve the fitness of individuals and the process is stopped when the maximum number of generations is reached.
Fitness evaluation module
This module evaluates the fitness of each individuals. The fitness of each population and the population members of the complete population is evaluated and stored in the fitness memory. For different functions/applications, only the fitness module is modified.
RNG module
RNG module has great importance for the proper operation of the DE. Here, a linear feedback shift register (LFSR) is used for generating random numbers, as it is easy to implement and it produces fairly good pseudo-randomness. This module generates random numbers for the initial population module, selection module, crossover and mutation modules. The seed for random number generator is programmable and it is initialised to a non-zero value. If all zero value appears in the seed, then XOR operations continues to generate zeros and output becomes always zero. The architecture of 32 bit LFSR with maximum length polynomial X 32 + X 22 + X 2 + X 1 + 1 is shown in Fig. 8 . This module generates 2 32 -1 random numbers.
Programmable system on chip (PSoC) implementation of the DE algorithm
PSoC is a programmable integrated system that has configurable processors, peripherals, memories, custom intellectual peripherals on a single FPGA. The proposed PSoC platform for implementing the DE algorithm is shown in Fig. 9 . PowerPC 440 (PPC440) processor communicates with external peripherals such as double data rate synchronous dynamic random-access memory (DDRZ SDRAM), Block Ram memory (BRAM) controllers, universal asynchronous receiver/transmitter (UART) (RS-232), timer and interrupt controllers, joint test action group (JTAG) controller, clock generator via processor local bus (PLB). PPC440 is preferred over MicroBlaze processor because of its high speed of operation and efficient resource utilisation. DDR2 and BRAM controllers are used for storing heap and stack of program and data. UART is used for serial data transfer between the end user and processor. Timer and interrupt controllers are used for profiling the application.
The clock generator provides necessary clock signals to all the modules and peripherals. USB JTAG controller is used to download the bitstream from host computer to FPGA board. PPC440 is directly coupled to the APU controller, which provides flexible high-bandwidth interface to DE coprocessor via fabric coprocessor bus (FCB). The coprocessor operates as an extension to the PowerPC. The APU interface details is shown in Fig. 10 .
DE coprocessor with APU interface
APU interface allows the coprocessor to execute extended instruction set concurrently with PowerPC 440 embedded processor instructions set. It provides various coprocessor functions, such as a fully compliant PowerPC floating-point unit [23] , or other custom function implementing algorithms appropriate for specific applications such as DE and PSO algorithms. The APU controller interface along with fabric coprocessor module behaves as a coprocessor for PPC440. Since, the APU is independent of the processor to peripheral interface, it does not add an extra load to the PLB bus. The PPC440 supports three primary types of instructions to be used for APU [23] . In this work, load/ store instructions are used for accessing the APU, in which maximum of 128 bits of data can be transferred in a single clock cycle or it can be transferred as four sets of 32 bits. The details of interfacing the DE IP core with the embedded processor is shown in Fig. 10 . The FCB bus is specifically targeted to host the DE coprocessor without intervention of the processor instructions. The DE core frequency is adjusted by the clock generator and set to 33 MHz. However, it can be increased up to maximum frequency of IP core subject to maintaining the desired clock ratio of processor to APU controller. Fig. 10 has two asynchronous first in, first out (FIFOs) (depth of four and width of 32 bits) interfaced at the input and output of the DE core. The input signal is processed as a stream and each stream has four samples and three of which are used for G MAX , NP and D. The remaining sample is used for checking whether the FIFO is 50% full or not. In this architecture, 'Output_Data' and 'Input_Data' are two 32 bit width data buses for data input and output of the IP core, respectively. The working principle of the DE IP core is described as below.
1. PowerPC writes the input data G MAX , NP and D in three clock cycles. The IP core receives data from the PowerPC, till the FIFO is full. This is ensured by the control signal 'Input_EoD'. When the FIFO is 50% full 'Input_EoD' becomes logical high. 2. When the FIFO is 50% full, it will enable 'DE_Input_En' as logical high, and when the IP core is ready for processing it will give a handshaking signal 'DE_Input_Rdy' as logical high. The FIFO sends the data to the IP core till 'DE_Input_EoD' is logical high. 3. When the IP core processes only single sample on the stream, it gives 'DE_Output_En' as logical high and this is acknowledged by the output FIFO with handshaking signal 'DE_Ouput_Rdy'. When this logical signal is high then the IP core sends the processed samples to the output FIFO till 'DE_Output_EoD' is high. 4. When the output FIFO is full, FIFO will send back the data to APU of PowerPC processor.
The APU wrapper contains two different modules namely IP_APU and APU_IP. The APU_IP module receives data from the processor and sends it to DE IP, whereas the IP_APU module receives the final solution from the DE IP core and sends it to the processor (PPC440). The APU_IP receives 128 bit signal, but the DE IP has only 32 bit width input, so the IP receives a full set of data in four clock cycles. Similarly the IP_APU module receives 128 bits of data from the IP in four clock cycles. The APU wrapper is interfaced with the IP core using six control signals 'Input_Data_En, Input_Data_Rdy, Input_EoD, Output_Data_En, Output_ Data_Rdy, Output_EoD'. A FSM with five states, that is, load, load_valid, store, store_valid and idle states control the data flow between Processor, IP_APU and APU_IP.
Experimental setup
In this work, the basic DE algorithm is considered for coprocessor implementation. The DE algorithmic parameters are tabulated in Table 3 . The DE software code is ported into the PPC440 processor using 32 bit fixed and floating point C code, later algorithm is coded in Verilog language for implementing in the hardware. An IP core for DE algorithm is developed and simulated using Xilinx ISE 10.1, then a synthesisable IP core is developed and subsequently a coprocessor is designed for accelerating the DE algorithm. For functional verification, the wrapper logic and the DE core are simulated using a test bench with code coverage of 99.9% and the simulation results are shown in Fig. 11 for Fun6 with G MAX = 1, NP = 8 and D = 4. When DE_Output_Rdy signal is logic high, the resultant fitness value is available at DE_Output_Data port which is in fixed point format. After logic high on DE_Output_Rdy signal, DE_Input_Rdy is high because of scheduling for next set of G MAX , NP and D values. From the results it is observed that the IP core consistently giving the same results.
The IP core frequency is set to 33 MHz and connected to PPC440 of Xilinx Virtex-5 FPGA using tightly coupled APU controller interface. The performance of the coprocessor is evaluated by optimising six numerical benchmark functions used in CEC 2005 and 2010 competitions [21, 22] . Owing to the empirical nature of DE algorithm, evolution parameters are subject to modification. In the proposed coprocessor, population size (NP), number of generations (G MAX ) and dimension (D) can be modified by the users through the embedded processor without redesigning the hardware.
Results and analysis

Timing results
Initially, the complete DE optimisation algorithm is ported into the PowerPC processor of Xilinx Virtex-5 FPGA for software implementation, then the complete DE algorithm is executed using the DE coprocessor. The execution time of the DE algorithm for different population sizes (8, 16, 32 ) and for three different generations (1, 50, 100) is evaluated for 20 independent runs. The average execution time of the algorithm using the coprocessor is tabulated in Table 6 and this is referred as hardware (HW) time. The acceleration factor (AF) of the coprocessor with respect to software floating and fixed point execution time are tabulated as AF (float) and AF (fixed), respectively. The values in parenthesis refer to the percentage of standard deviation of execution time. From this table, it is observed that the coprocessor execution time is up to 73.14-160.20× faster than the software execution time for floating point DE algorithm. In contrast, it is only 2.19-27.63× faster compared with fixed point DE algorithm. Further it is observed that for lower dimension functions coprocessor acceleration AF (fixed) is small as compared to higher dimension functions. This table also reveals that the execution time of HW coprocessor for different functions is scaling up with the population size and maximum number of generations. A comparison of the average speedup of floating to fixed, floating to hardware and fixed to hardware implementations for different benchmark function (G MAX = 100 and NP = 8) are illustrated in Fig. 12. 
Synthesis results
The hardware IP is designed with multiple modules using Verilog language and the code size is ∼1000 lines. It is parameterised in terms of DE population size (NP), dimension (D) and maximum number of generation G MAX . Table 7 shows XST (Xilinx Synthesis Tool) synthesis results (resource utilisation) for optimising different benchmark functions with population size NP = 32. The targeted FPGA is Xilinx Virtex-5 XC5VFX70T. It has several device primitives like BRAM, DSP48E, Slices and LUTs. Each BRAM is of 36 kbits size. It can be configured as two separate memories of 18 kbits size each. DSP48E slice is a digital signal processing logic element and it can perform multiply-accumulator, multiply-adder, one or n-step counter along with logic operations such as AND, OR and XOR. Slices are combination of LUTs and flip flops, used for implementing the digital logic of desired IP. For higher dimensional test bench functions 6% Block RAM (BRAM) is utilised compared with other functions. The resource utilisation for the Fun2 (60% of DSP48E, 7% of Slice www.ietdl.org registers, 16% of LUTs and 22% of Slices) is high compared with other functions due to its computational complexity.
9 Real-time application as a case study: SA in cognitive radio
In the current wireless communication domain, spectrum scarcity is because of the rigid licensing policy [24] . Dynamic SA is an alternative to overcome this problem. Cognitive radio is the future technology which supports the dynamic SA [25] . In the cognitive radio domain there are two different type of users (a) primary user or licensed user and (b) secondary user or unlicensed user. A primary user has the priority to use an allotted spectrum band, however, in the absence of primary user, a secondary user can access the same band till a primary user demands for it. In the distributed network architecture, each secondary user determines the spectrum availability and allocate the desired spectrum. In this scheme, a secondary user considers the locally available information from the neighbourhood users and decides spectrum assignment.
As each secondary user implicitly have an embedded computing platform, the SA task can be performed by it. However, running the SA on an embedded processor consumes most of the platform resources, thereby degrading the performance of other applications running on it. Hence, there is a requirement for a dedicated hardware peripheral for performing the SA task. This is the motivation for choosing this application as a case study in this work. This problem is posed in [10] and have been solved by using genetic, quantum genetic and PSO algorithms. In this paper, the same problem is solved by using the developed DE hardware coprocessor regarding execution speedup and acceleration factor. The general SA model consists of a channel availability matrix (L) representing the channel availability, L = {l n,m |l n, m ∈ {0, 1}} N × M , where l n,m = 1 if and only if channel m is available to user n, else l n,m = 0, channel reward matrix (B) representing the channel reward, B = {b n,m } N × M , where b n,m represents the reward that can be obtained by the user n that uses channel m, and an interference constraint matrix (C) representing the interference constraints among the secondary users (n and p), C = {c n,p,m |c n,p,m belongs to {0, 1}} N × N × M , where c n,p,m = 1 if both the secondary users n and p use the channel m simultaneously else c n,p,m = 0 [10] . The required solution is a conflict free channel assignment matrix A = {a n,m |a n,m belongs to {0, 1}} N × M , where a n,m = 1 if channel m is allocated to secondary user n, else a n,m = 0 [10, 26] . In this work, the reward matrix and constraint matrix are initialised as [26] .
In real time applications, users perform network-wide SA operation faster than the change in spectrum environment. In this work, the assumption is that the location, available spectrum etc. are static, thus L, B and C remains constant during a particular allocation period. As the SA model can be inherently seen as an optimisation problem, so the DE algorithm is proposed to solve the allocation problem. The proposed architecture for DE algorithm is exploited to select the appropriate channel for secondary users from the available channels without interfering with the primary users. The conflict free spectrum assignment matrix A must satisfy the interference constraints defined by C a n,m a p,m = 0, if c n,
The above equation states that if the constraint c n,p,m = 1 then one of the secondary user between n and p can use the channel m depending on the reward value of the user. If the user n has more reward than user p, then the channel m will be used by the user n and vice versa. For the given L and C, the objective of SA is to obtain the conflict free channel assignment matrix by maximising the reward sum U(R). Thus the optimal conflict free channel assignment matrix A* is selected from the set of conflict free channel assignment for a given set of N users and M spectrum bands and constraints C as shown in (2)
For improving the efficiency of SA one or more fitness functions need to optimised. In this work, maximum sum reward (MSR) is considered as the fitness function to validate the hardware framework. MSR is defined as [10] MSR:
In the proposed SA scheme, each population specifies a possible conflict free channel assignment matrix. To decrease the search space, we propose to encode only the elements that corresponds to l n,m = 1. The length of the population is equal to the number of elements equal to 1 in the L. The value of every element in the population is randomly generated that satisfies interference constraints C. The proposed DE-based SA algorithm proceeds as follows: 3. Map the population x d,i to a n,m , where (n, m) is the dth element of L 1 for all d ∈ 1, …, D and i ∈ (1, …, NP). The complete A matrix should satisfy the constraint matrix C, if any violations are there then one of the user will get the channel m depending on their reward value and the corresponding element of the matrix A is set to 1 or 0. 4. Compute the fitness of the each individual of the current population. 5. Carry out the mutation, crossover, selection and update the population as defined in Algorithm 1 (see Fig. 1 ). 6. If it reaches the predefined maximum generation then derive the assignment matrix as mentioned in the step 3 and stop the process else go to step 3 and continue.
Both the fitness function and DE algorithm are evaluated in the coprocessor. Here, the algorithmic parameters (G MAX , NP) and the SA parameters such as number of secondary users N, number of channels M and number of primary users K are parameterised and can be changed through the embedded processor. The execution time for evaluating the SA both in the embedded processor (software) and in the coprocessor is executed for 20 independent runs. Table 8 shows the software execution and acceleration factor for both arithmetic implementation. Table 9 shows the coprocessor execution time (HW) and acceleration factor w. r.to both arithmetic of algorithms executed in the processor. The value parenthesis refers to the % of standard deviation. In these tables MSR (N × M × K) corresponds to the maximum sum reward for N number of secondary users, M number of channels and K number of primary users.
From Table 8 , it is observed that fixed point software implementation gains acceleration of 11.43-19.64× over floating point implementation. Table 9 shows that the proposed DE coprocessor processing speed is ∼5.19-6.91 × faster than fixed point software implementation and 76.79-105 × faster then floating point software implementation in the embedded processor (PPC 440). Table 10 tabulates execution time in terms of Mega clock cycles for optimizing MSR (5 × 5 × 5) objective function. Fig. 13 shows the comparison of the average speedup of www.ietdl.org floating to fixed, floating to hardware and fixed to hardware implementations of SA problem with G MAX = 300 and NP = 32 for three MSR (N × M × K) objective functions. It is observed that AF for floating to fixed is ∼11.53-15.8× because of high computational complexity in both arithmetic, but AF for coprocessor is 79.75-98.96× for floating and 5.89-6.91× over fixed arithmetic because of faster execution speed. The algorithm is run for 20 independent runs with N = 10, M = 10 and K = 10 and the convergence graph is shown in Fig. 14 . In this graph, SW means result obtained using PowerPC processor and HW means result obtained using the DE coprocessor. Initially there is some difference between the SW and HW results because of random number generation in the hardware, but after some iterations both attains almost same value. The curve shows that as higher the reward value, the user will be alloted a fair spectrum band. Table 11 tabulates the minimum, maximum, average, standard deviation and percentage of standard deviation of fitness value.
Conclusions
In this paper, we have proposed a scalable coprocessor with APU interface for accelerating the execution speed of the DE algorithm and it was implemented in a Xilinx Virtex-5 FPGA. To avoid the bus overhead, the complete DE algorithm with fitness function was implemented in the hardware instead of partitioning the design into software and hardware. To validate the performance of the coprocessor, firstly, six numbers of test-bench functions were optimised, then a practical problem of SA was solved using the coprocessor. For validation of the proposed framework the execution time for fixed point and floating point software implementation of DE algorithm is compared while optimizing test bench function and SA problem. The experimental results revealed that the software implementation of fixed point DE algorithm accelerated the execution speed by approximately 43.19-45.69× while optimising less complex test function (Fun4) and by 4.96-5.67× while optimising the 32 dimension test function (Fun6), as compared to the floating point D algorithm implemented in the embedded processor. The fixed point DE algorithm, along with the fitness evaluation, was also implemented in the coprocessor and the experimental results shown that an acceleration of approximately by 25-27.63× and 135.79-147.39× is attained while optimising a 32 dimension Fun6 complex test function compared to the fixed and floating point software implementation respectively. For optimising less complex fitness functions like Fun1, the coprocessor attained speedup of approximately by 2.43-3.94× over fixed point and 82.09-98.20× over floating point software implementation respectively. At the same time it was also observed that for SA problem, the coprocessor attained an acceleration of ∼76.79-105× and 5.19-6.91× compared to the floating point and fixed point point implementation of the algorithm in embedded processor, respectively. The proposed framework can be extended for accelerating other evolutionary techniques and can be used for designing Evolvable Hardware. 
