Abstract--A Genetic Algorithm (GA) is an intelligent search strategy supported by operations inspired by biological evolution. Although a GA is able to find very good solutions for a variety of applications, it typically requires many computations and iterations to be effective, and the amount of time consumed by these computations and iterations is enormous. Thus, software implementations of GAs applied to increasingly complex problems and large search spaces can cause unacceptable delays. An alternative to this approach is the hardware implementation of GAs in order to achieve tremendous speedup over software counterparts by exploiting the inherent parallelism of the GA paradigm. This paper presents the design of libraries of hardware modules for a GA system -one library in the Hardware Description Language (HDL) Verilog HDL and one in the language VHDL. Each library is based on a widely used MATLAB library.
I. INTRODUCTION
Genetic Algorithms (GAs) are robust search and optimization algorithms that mimic the theory of evolution and natural selection. GAs were originally developed by John Holland in 1975 [1] and have since been applied to many applications such as the Traveling Salesman Problem (TSP) [2] , circuit partitioning problems [3] etc.
GAs are extensively used in many applications across a large and growing number of disciplines. Although a GA is able to find very good solutions for a variety of applications, the amount of time consumed for large computations and iterations is enormous. Hence, software implementation of GAs for increasingly complex applications can cause unacceptable delays. Also, GAs lend themselves easily to pipelining and parallelization. All these factors make GAs good candidates for hardware implementation.
Since GA operations are relatively simple, GAs are good candidates for FPGAs and reconfigurable systems. This paper presents the design of a library of modules to support hardware implementation of GAs, providing a rich set of modules and an example architecture which users can easily customize for specific applications. The various modules of the GA system are described in the Hardware Description Languages (HDLs) Verilog HDL and VHDL. These HDL modules are incorporated into an example architecture in order to exploit the features of pipelining and parallelization. Also, the HDL modules can be used by other researchers to implement GAs in hardware and can also be incorporated into efficient architectures for specific problems of interest. These hardware modules can be further synthesized and mapped to Application Specific Integrated Circuits (ASICs) and Field Programmable Gate Arrays (FPGAs).
II. PREVIOUS WORK IN HARDWARE GAS
The past several years have seen a number of Genetic Algorithms implemented in hardware. This section gives a brief overview of the various GA implementations by different researchers in the last few years.
Koonar et al developed a reconfigurable GA for VLSI CAD design. This work described a GA architecture for circuit partitioning in VLSI physical design automation. The design used 6 modules along with three external memories. This design was synthesized on Virtex part xcv 50e using Xilinx ISE 4.1. This hardware implementation achieved a speedup of 100x over its software counterpart [3] . This design was implemented in VHDL.
Tommiska et al realized a general purpose GA in Altera HDL (AHDL). The design was simulated with programmable logic devices of Altera's Flex 10k Field Programmable Gate Array (FPGA) family. The GA was run in a pipelined fashion that enabled it to be 212 times faster than the software solution [4] .
Graham et al implemented a Splash 2 Parallel GA (SPGA) for optimizing symmetric traveling salesman problems in VHDL. Each processor in the SPGA consisted of four Xilinx 4010 FPGAs and associated memories. This system was found to be 6 -10 times faster than the equivalent software version [2, 5] . This GA system was based on the Simple GA scheme.
Scott et al described a general purpose Hardware Genetic Algorithm (HGA) to be used in many applications where conventional GAs were too slow. This design was realized using VHDL and implemented on a BORG prototyping board which consisted of five Xilinx XC4000 FPGAs. Various HGA implementations were described here which exploited parallelism and coarse-grained pipelining [6] . This GA implementation was described in VHDL and was based on the Simple GA or total replacement scheme.
Shackleford et al developed a survival-based, steady state GA that was aimed at achieving high performance. The prototype GA machine was designed using the Tsutsuji logic synthesis system and implemented on an Aptix AXB-MP3 Field Programmable Circuit Board (FPCB) populated with six FPGAs [7] . This implementation was described in VHDL.
In contrast to the simple GA, Aportewan et al implemented a compact GA in Verilog HDL as they claimed this was more suitable for hardware implementation. This system was found to be 1000x faster than the software version [8] for the 32-bit one-max problem.
III. GAS IMPLEMENTED IN SOFTWARE
There are many implementations of GAs in software. Here we concentrate on one widely used implementation, the Genetic Algorithm Optimization Toolbox (GAOT), implemented in MATLAB [9] . This toolbox is widely used and thus is a good basis for defining our hardware module library. The GAOT is a Genetic Algorithm system that is employed for function optimization using binary and real representations. MATLAB was used for the implementation as it provided many built in auxiliary functions useful for function optimization and also because it is useful for numerical computations. The toolbox consists of a library of GA modules such as Roulette Wheel, Tournament selection and Normalized Geometric selection for the selection process; One-point, Arithmetic and Heuristic methods for crossover; and various mutation methods. The Normalized Geometric selection and the Heuristic crossover method are used for real representation and are not implemented in this work. Floating point representations have not been dealt in this work.
The toolbox has been tested on a series of non-linear, nonconvex and multi-modal functions. The results show that the algorithm finds better solutions with less function evaluations than simulated annealing [9] .
IV. MODULES AND ARCHITECTURES IMPLEMENTED
This section describes the various modules of the Genetic Algorithm system. These GA modules are implemented in Verilog HDL and VHDL and consist of the different types of Random Number Generators (RNGs), Selection module (SM), Crossover modules (CMs), Mutation module and Fitness modules. This section also describes one architecture that the modules could be incorporated into, which exploit the features of pipelining and parallelism.
The Linear Feedback Shift Register (LFSR) and the multiple LFSR based RNGs were implemented in this work. The SM employed for all the GA architectures was the Roulette Wheel Selection as it was the most widely used selection method. The different CMs that were implemented included the Arithmetic Crossover, the Uniform Crossover, One-point Crossover and the Two-point Crossover. Only one single-bit mutation module was used for all implementations. The fitness module is application specific and is designed according to the function to be optimized or evaluated. In this work, two separate functions have been presented for optimization and these are based on functions implemented by Scott [6] in his thesis work. The two mathematical functions are:
These functions are optimized by using different combinations of modules from the library developed in this work.
The various GA modules were connected together according to the architecture illustrated in Fig. 1 . The initial population is generated by the Random Number Generator (RNG). This population is then fed into the Fitness Module (FM), which evaluates the fitness for the entire population set and then passes on the population members along with their fitness values to the Selection Module (SM). The SM selects two parents based on the Roulette Wheel selection method and sends them to the Crossover Module (CM) which could be one of the four methods implemented in this work (one-point, two-point, uniform and arithmetic crossover). The resultant offspring from the CM are then mutated before being sent to the fitness module. The second FM evaluates the fitness of the final members and this population replaces the old population. This architecture is based on the Simple GA as it replaces the old population entirely.
Once the first set of parents are selected and passed over to the CM, the SM works on selecting the next set of parents while the crossover, mutation and fitness evaluation is being done. This pipelined system helps in increasing the overall speed of the system. The system also has two fitness modules, one to evaluate the fitness of the original population and another one to evaluate the fitness of the final population. 
V. PERFORMANCE ANALYSIS
In this work, various GA modules have been implemented in both Verilog HDL and VHDL. A complete list of the modules implemented in hardware is given in Table I. This table also lists the modules available in the GAOT toolbox implemented in MATLAB [9] . Only the modules for binary representation from the GAOT toolbox have been included in Table I . The MATLAB implementation has other modules for real number representation, but these have not been included in this work. The VHDL and Verilog HDL modules have only been implemented in binary. Various combinations of the modules are incorporated into the GA architecture and the results from these implementations are compared to each other. The performance of the hardware implementations are also compared to the software GA implementation. The modules from Table I were used interchangeably in the architecture shown in Fig. 1 . Both 8-bit and 16-bit versions are implemented.
The GA architecture illustrated in Fig. 1 was simulated to verify correct functionality and to analyze the performance. The performance analyses consisted of a detailed comparison between the results obtained for the different GA system implementations. The GA modules were functionally verified in two steps. The first step involved the functional verification of each module individually, where the modules were tested to operate correctly under all possible conditions. In the second step, the modules were connected together in the configuration given in Fig. 1 and simulated for the two fitness functions. During these simulations, each module and the system as a whole were checked for functionality. The successful completion of these two steps ensured the proper working of the modules and the entire system.
The Verilog HDL and VHDL simulations were carried out on the Cadence platform using NCSIM version 5.10 and SIMVISION was used to view the output waveforms. The Cadence software was run on the UNIX platform on the SUN SOLARIS workstation which was the Ultra10 system. The MATLAB version 7.0.1 was used for the simulation of the GAOT toolbox.
TABLE I
The different GA modules implemented in hardware and the version of the MATLAB implementation from [9] which is compared to them.
In the first simulation run for the fitness function f(x) = 2x, a LFSR based RNG, a Roulette Wheel selection, a One-point crossover and a mutation function were employed. This implementation was then simulated for each of the different Crossover Modules. The next implementation consisted of a Multiple LFSR based RNG, a Roulette Wheel selection, a Onepoint crossover and mutation module. Again, this setup was repeated for different crossover modules. Next, the same process was repeated for the second fitness function f(x) = 2x 3 -45x 2 +300x. This entire procedure was carried out for an 8-bit GA system and a 16-bit GA system. The maximum value obtained by each GA system gives a measure of the performance of the system. Each of these implementations was simulated for 16 runs, with a unique seed for each run. The maximum fitness value for each of these runs was observed and the maximum, minimum and average values for the set of maxima were tabulated. The standard deviation and the 95% confidence interval were also calculated and all these for the fitness function f(x) =2x implemented using the LFSR based RNG, are tabulated in Table  II . The fitness functions were maximized over the domain 0 ≤ x ≤ 255 for the 8-bit GA system; hence the maximum attainable fitness value for x is 510. The performance results for the fitness function f(x) =2x that is implemented using multiple LFSR based RNG are tabulated in Table III.   TABLE III Performance results for f(x) =2x using multiple LFSR based RNG A comparison of the simulation results from the LFSR based RNG GA implementations, Multiple LFSR based RNG systems and the MATLAB implementation is given in Fig. 2 . A similar simulation setup was run for the fitness function f(x) =2x 3 -45x 2 +300x and the performance analysis is given in Table IV below. For the fitness function f(x) = 2x 3 -45x 2 +300x where 0 ≤ x ≤ 255, the maximum fitness value that can be attained is 30313125. The comparison graph for the two implementations for the fitness function f(x) =2x 3 -45x 2 +300x -one with the LFSR based RNG and another with the multiple LFSR based RNG is shown in Fig. 3 below. VI. CONCLUSIONS
From the performance analysis of the different modules, we conclude the following:
-the initial population generated by multiple LFSR based RNG is more suitable for use with GA systems over the LFSR based RNG. From the simulation results, it was obvious that the fitness values for the population generated by multiple LFSR based RNG were higher than the simple LFSR based RNG. This is in agreement with results on hardware-based RNGs reported in [10] .
-the type of crossover operator used in the GA system did not have a major effect on the performance of the system except with the Uniform Crossover GA implementation using the LFSR based RNG. This module did not work as expected, probably because of the initial population or the random bits from the two parent chromosomes that formed the offspring. Apart from this, any crossover module discussed in this work can be used to implement an efficient GA system.
-the performance results for the GA system maximizing the function f(x) =2x and the function f(x) = 2x 3 -45x 2 +300x were good. Both the systems were able to find the maximum attainable value for the function in most of the cases. This demonstrates that GAs are efficient systems for optimizing specific functions.
-generating the new population (considering a population size of 16), took 74 clock cycles. Considering a 1.2GHz system, this would produce a delay of 61.42ns. This time delay could be further decreased by using parallelized GA systems, each producing a set of population members.
-the performance results obtained by the GA system for both the 8-bit and 16-bit Verilog and VHDL implementations are comparable to the MATLAB results.
The modules implemented in this thesis work only deal with binary representations. This work can be extended to include real number representations. Including real number representation will allow the GA system to be used for a wide range of applications. These floating point implementations can be realized in accordance with the IEEE 754 floating point standard.
The Verilog and VHDL modules implemented in this work can be synthesized to obtain the gate-level netlist and meet constraints like area, power and speed. These modules can also be used by other researchers for specific applications and further mapped to Application Specific Integrated Circuits and Field Programmable Gate Arrays. 
