A hardware/software platform for intrinsic evolvable hardware is designed and evaluated for digital circuit design and repair on Xilinx Virtex II Pro Field Programmable Gate Arrays (FPGAs). Dynamic bitstream compilation for mutation and crossover operators is achieved by directly manipulating the bitstream using a layered framework. Experimental results on a case study have shown that benchmark circuit evolution from an unseeded initial population, as well as a complete recovery of a stuck-at fault is achievable using this platform. An average of 0.47 microseconds is required to perform the genetic mutation, 4.2 microseconds to perform the single point conventional crossover, 3.1 microseconds to perform Partial Match Crossover (PMX) as well as Order Crossover (OX), 2.8 microseconds to perform Cycle Crossover (CX), and 1.1 milliseconds for one input pattern intrinsic evaluation. These represent a performance advantage of three orders of magnitude over the JBITS software framework and more than seven orders of magnitude over the Xilinx design tool driven flow for realizing intrinsic genetic operators on a Virtex II Pro device.
INTRODUCTION
Intrinsic evolutionary approaches such as Genetic Algorithms (GAs) appear throughout the literature as a means to realize design and repair strategies on hardware-in-the-loop FPGA-based digital systems [1] [2] [3] . GAs realize search strategies based on the Darwinian evolution principles by performing genetic operations such as mutation and crossover. Several variations of GAs were introduced to enhance the performance and speed of convergence to a solution for FPGA-based systems [4] , however, many of these realizations employ software-in-the-loop simulations rather than intrinsic implementations in the FPGA fabric. Challenges of realizing practical intrinsic evolutionary strategies include the mapping of the genotype in the GA into its corresponding phenotype on the fabric, and the limited control over process automation of altering and downloading safe bitstreams onto the device. These issues are exacerbated when the critical portions of bitstream representation are proprietary.
Only a handful of intrinsic evolution platforms have been proposed throughout the literature.
However, these platforms are still inadequate since they either support a course granularity evolution which yields a limited capability and flexibility, or they entail huge resource overhead to workaround the reconfiguration limitations which leads to a relatively high area and power budgets that might not be tolerable in highly constrained applications such as space mission systems.
In this paper, an approach that provides a fast hardware/software interface between the GA and the FPGA device via a straightforward data-structure and Application Programming Interfaces (APIs) is presented. A layered design is used to perform mapping operations at the finest granularity directly on the bitstream to modify LookUp Table ( The remainder of the paper is organized as follows: Section 2 provides an overview of related work. Section 3 introduces the platform design. Section 4 discusses the experimental design and results, and Section 5 concludes the paper and suggests a direction for future work.
PREVIOUS WORK
There are two paradigms for implementing GAs in reconfigurable applications: Extrinsic Evolution via functional models that abstract the physical aspects of the real device, and Intrinsic Evolution on the actual devices. Extrinsic approaches simplify the evolution process as they operate on software models of the FPGAs. However for applications like in-situ fault handling on deep space missions, not all fault types can be readily accommodated with software models.
Additionally, abstracting the physical aspects of the target device complicates rendering the final designs into actual on-board circuits, for instance, limitations such as routability of the design cannot be ensured until the final stages of the configuration process. Furthermore, fitness evaluation on hardware usually requires less time than software simulations, and that makes intrinsic evolution mostly considered for its higher performance and scalability as an efficient approach to realizing physical designs in critical systems.
Several previous research efforts have addressed intrinsic evolution. A successful attempt on Field Programmable Transistor Array (FPTA) chips was carried out by [3] . The authors proposed new ideas for long-term hardware reliability using evolvable hardware techniques via an evolutionary design tool named EHWPack that facilitates intrinsic evolution by incorporating the PGAPack genetic engine with Labview test-bed running on UNIX workstation. They were able to intrinsically evolve a Digital XNOR Gate on two connected FPTA boards. In this paper, we target FPGAs rather than FPTAs and specifically the popular Xilinx Virtex II Pro device.
Miller, Thomson, and Fogarty [2] previously addressed the importance of direct evolution on the Xilinx 6216 FPGA devices; the research explored the effect of the device physical constraints on evolving digital circuits. A mapping between the representation genotype and the device phenotype was proposed, however, no implementation details were presented. Hollingworth, Smith, and Tyrrell developed intrinsic evolution platform for a 2-bit adder on a Xilinx FPGA with partial reconfiguration to improve evolution time [5] . However, they used the JBits interface for run-time reconfiguration. JBits is Java-based, and being interpreted can face scalability and performance issues and is no longer supported.
Another way to achieve online reconfigurability is proposed by Upegui, Peña-Reyes, and Sanchez in [6] . In this approach, the system is divided into sub-modules, and several different partial reconfiguration bitstreams are generated in advance for each module using Xilinx Module Based Partial Reconfiguration flow. GA combines partial bitstreams that best perform the required task optimally or sub-optimally. This simulated approach is constrained by the limited number of possible combinations generated beforehand. Furthermore, its course granularity makes it only suitable for certain applications where the system can be divided into well-defined modules with fixed interfaces such as the neural network use case discussed by the authors.
A promising technique called the Virtual Reconfigurable Circuit (VRC) method was proposed by Sekanina in [7] and [8] and also in a similar work by Glette and Torresen [9] . This method does not change the bitstream of the FPGA itself, but rather changes the register values of a reconfigurable circuit already implemented on the FPGA, and obtains virtual reconfigurability.
Although this method provides online reconfigurability, it incurs a very high area and power overhead and could increase the number of elements that can break from a fault tolerance point of view. Moreover, these schemes implement phenotype abstraction by predefining several functions that can be performed by a computational cell. Although, this abstraction has shown benefit in convergence time in some cases [10] , it incurs mapping overhead and adds constraints to the flexibility which limits the search space and doesn't exploit fully the hardware capability.
In several previous works [11] [12] [13] , methodologies are proposed to enable runtime FPGA reconfiguration while keeping the Xilinx CAD tools out of the loop to achieve smaller reconfiguration delays. Such approaches can be used as platforms to achieving tractable intrinsic evolution.
In a previous work, a Multilayer Runtime Reconfiguration Architecture (MRRA) was developed for Autonomous Runtime Partial Reconfiguration of FPGA devices [14] . The tool comprises three layers, namely Logic, Translation, and Reconfiguration layers, with well-defined interfaces for modularity and reuse. In addition, a standard set of Application Programming Interfaces (APIs) was utilized for communication with the target device. Results had shown the ability of the framework to support autonomous and dynamic reconfiguration operations. The MRRA was extended to support two basic genetic operators [15] which is extended in this paper to support five genetic operators namely: Single point conventional crossover, Partial Match Crossover (PMX) [16] , Order Crossover (OX) [16, 17] , Cycle Crossover (CX) [16, 18] , and Genetic Mutation directly to realize intrinsic evolution on Xilinx Virtex II Pro devices. All five genetic operators are evaluated experimentally and results are compared for their ability to achieve fault repair in a number of fault handling scenarios as discussed in the following sections.
JTAG-DRIVEN PLATFORM
The developed platform consists of MRRA components that reside on the FPGA chip, and software components on the host PC, however, they are developed into layered modules that can be readily migrated to work on the PowerPC on chip in later phases of this research. The main elements of the platform are shown in Fig. 1 containing components as follows: PerformCrossover: Performs a probability-driven single point genetic crossover on the two parent chromosomes. Crossover point is randomly assigned for both parents according to a random number generator. The offspring yielded is loaded back to the calling GA Engine.
PerformPMX: Performs a probability-driven two-point genetic partially matched crossover (PMX) on the two parent chromosomes. Crossover points are randomly assigned for both parents according to a random number generator. The offspring inherits the chromosomal section between the two crossover points (Matching Section) from one parent and the rest of the chromosomal content is inherited from the other parent. The inheritance from the second parent is done in such a way that prevents any duplication of the same genetic material as shown in the example in Fig. 2 .
In this example, the rectangles in each chromosome represent the FPGA LUT's individual fields, and the number inside the rectangle denotes the logic configuration (the bit content that the LUT holds) assigned to that LUT. This number is assigned to the initial configuration of each LUT in order to keep track on that configuration during the evolution process and avoid its duplication.
PMX operator was originally designed for solving permutation problems such as the well known Traveling Salesman Problem (TSP) [18, 20] . The cities in the TSP are analogous to the LUTs in this problem. Hence the PMX operator reorders the different configurations among the LUTs without duplicating the same configuration on multiple LUTs. This operator is more preservative to the genetic material of the chromosome than the conventional crossover, and therefore may find a faster functionality refurbishment by simply assigning the original configuration of a faulty LUT to another unused one. This is especially true when a higher routing capability is achieved. The offspring yielded is loaded back to the calling GA Engine. EvaluateInput: Receives the test input pattern from the GA Engine. The input pattern is then applied to the circuit on chip via the GNAT module. Once the output is evaluated, the Chromosome Manipulator module reads it and sends it back to the GA Engine for fitness assessment.
In summary, the Chromosome Manipulator layer provides a logical abstraction of genetic operators to the GA Engine module. This facilitates the integration of any GA at the top layer by making the hardware implementation details transparent.
MRRA: This platform developed by our team is a Multilayer Runtime Reconfiguration
Architecture for Autonomous Runtime Partial Reconfiguration of FPGA devices [17] . MRRA Bitstream File: In the developed platform, an initial pre-compiled bitstream is generated using the Xilinx CAD tools. It contains the interconnected LUTs to be configured by the platform to evolve and realize an original circuit Design or restore functionality via Repair the functionality sought. The platform then manipulates this bitstream file to carry out the physical mapping of the crossover or mutation performed on the genotype representation.
The task-flow of the platform is divided into three phases:
Initialization:
The initialization process aims at obtaining the configuration from the baseline bitstream file which has been manually designed using the Xilinx CAD tools. As depicted in On the other hand, the Pattern Evaluation phase starts by sending the input patterns serially to the FPGA chip via the JTAG according to the JTAG clock frequency. After that, the GNAT module groups back the serial bits of each input and applies them to the corresponding circuit's input ports. Having the circuit's output evaluated at the output ports, the GNAT sends it back to the MRRA via the JTAG which then passes it to the GA via the Chromosome Manipulator layer.
Since the genetic operations in the developed platform are performed directly on the bitstream binary content, it only takes 0.4 microseconds to perform the mutation operation and 4 microseconds for the crossover operation. Hence, the main performance bottleneck is the communication channel with the FPGA chip.
INTRINSIC EVOLUTION CASE STUDY

Experimental Design
The circuit used to demonstrate the platform workflow is a 4-bit adder. It provides a tractable circuit for the GA to evolve that exhibits characteristics for large arithmetic circuits including a variable amount of redundancy and combinational logic behavior. The GA parameters used throughout the experiments are shown in Table 1 . A total of 8 LUTs were used in the design experiments. This number was increased to 13 LUTs in the repair experiment to add a redundancy margin for the GA to evolve within. All GA parameters were extracted by running extrinsic evolution of the GA and finding out the optimal values. The table shows the range of tested values for each parameter along with the optimal one. Population sizes between 5 and 20 were evaluated and best results were achieved using population size of 10. Crossover rates in the range of 30% to 90% in increments of 10% were evaluated indicating the GA performed well when the value was near 60%. Therefore, a rate of 60% was used for the four different types of crossover used in the experiments: Single-Point crossover, PMX, OX, and CX. Similar analysis was used to determine baseline values for the other parameters summarized in Table 1 . There are three types of experiments performed as follows:
Unseeded Design: In this experiment, the GA evolved the 4-bit adder circuit with only a randomly-seeded initial population. The purpose of this experiment is to demonstrate the capability to intrinsically evolve 100% functional circuits starting from a random bitstream. A baseline bitstream was generated manually using Xilinx ISE Project Navigator. This bitstream contains the 8 interconnected LUTs on which the circuit is to be evolved along with the GNAT core connected to the JTAG component.
Seeded Design:
In this experiment, the GA evolved the 4-bit adder circuit starting with a population of partially functional seeded individuals in addition to completely random ones. The partially functional seeds were originally fully functional designs which were tampered by deliberately exposing them to mutation operator. This arrangement emulates a fault-scenario in real life avionics in which the configuration bitstream is partially affected by Single Event Upset (SEU) due to radiation burst or any other severe environmental event. Typically, scrubbing is used to replace bitstream with an intact version stored on nonvolatile storage. However, this experiment could operate even in the event of permanent damage to the underlying fabric.
Repair: A single stuck-at fault was adopted as a case study to show the capability of the platform to repair a faulty circuit. Two issues that should be highlighted here include:
I. Fault Injection:
Since an actual fault cannot be readily nor precisely introduced into the device, the circuit is stimulated to behave as if the fault actually exists. This course of action becomes more complicated considering the fact that the platform allows only functional logic manipulation without the possibility of altering the device interconnects. Hence, the bitstream was processed directly before configuring the device to modify the contents of one LUT so that it behaves as if a stuck-at fault is present.
The LUT in the Virtex II Pro chip is a 16-bit lookup table with four inputs and one output. The fault behavior is induced by programming the content of the 16 bits in a way that yields the faulty behavior emulated. For example, if the least significant bit (LSB) input pin is stuck-at zero, only the memory locations of the pattern (XXX0) 2 -where X is the Don't Care logic-will be accessible. This behavior can be achieved by copying the content of the memory locations of the pattern (XXX0) 2 into (XXX1) 2 and overwriting their old values as shown in Fig. 7 . Likewise, if the fault is stuck-at one in the second LSB input pin, by following the previous analysis, any reference to (XX0X) 2 should be directed to (XX1X). The same concept can be extended to the other last two inputs, where the location of the error determines the stride between the memory locations to copy, and the value of the stuck-at condition (zero or one) determines the direction of the copy operation (left or right) as shown in Fig. 7 . Moreover, for faults in the LUT output pin, all LUT memory locations should be filled with the value zero for stuck-at zero faults and with the value one for stuck-at one faults.
To analyze this fault injection paradigm, let 
II. Degree of Redundancy:
During the initial runs, the GA failed to achieve complete repair.
It turned out that the search space given to the GA was exceedingly narrow, and consequently, the GA failed to avoid the faulty resource by constructing alternative paths. To remedy this limitation, redundancy was introduced by adding extra unused LUTs to the original design. This was dealt with within the standard partial reconfiguration flow presented by Xilinx [21] which has a module-level granularity that requires each module to be arranged at slice column level with a four-slice boundary requirement. A bus macro is also required to establish a communication means amongst modules. Besides the restricted flexibility due to the coarse granularity, this module-based partial reconfiguration flow can only be controlled at a very high level during design time. Hence, mostly depending on the Xilinx tool sets to interpret the placement and routing process, which may encounter some erroneous implementations especially when the partial configuration module's size requires extensive routing resources.
Intuitively, direct bitstream manipulation on the other hand, is superior to the module-based partial reconfiguration flow, in the sense that it doesn't abide to such aforementioned constraints.
Not only can direct bitstream manipulation provide precise control between both genotype and phenotype representations, but also it overrides the lengthy time delay caused by Xilinx tool sets.
Results
For the four crossover operators aforementioned, each combined with the mutation operator, five intrinsic evolutions were achieved for each of the three experiments: the unseeded, seeded, and repair using the presented platform. The GA parameters listed in Table 1 were used. The following aspects were measured to quantify the capability of the platform to carry out the evolution process: G : This is the total number of generation evolved during the run.
Timing Information: This is the timing information for each run and is divided into four metrics:
: This is the time elapsed to perform the GA crossover and mutation during the entire run.
FE : This is the time elapsed to apply the input patterns and read back the corresponding outputs for all the fitness evaluations during the entire run.
C : This is the average time taken by a single genetic crossover for a certain GA run. The crossover could be a single point conventional crossover, PMX, OX, or CX.
M : This is the average time taken by a single genetic mutation for a certain GA run.
Experimental results are listed in Table 2, Table 3 , Table 4 , and Table 5 . It can be seen from the results that the GA operators' time is small compared to the fitness measurement time which is around one millisecond for each pattern evaluation. In this paper the JTAG serial port is used which imposes a substantial time delay that reaches up to 22 seconds to configure the entire device using the Xilinx Parallel Cable III which can be reduced to 1 second using the Xilinx Parallel Cable IV. This performance overhead can be considerably reduced if other interfaces are used such as the SelectMap parallel port or the Internal Configuration Access Port (ICAP) on a System on a Chip implementation using the PowerPC.
Device programming time is high due to two main reasons, the first one is the fact that the JTAG port was used to download the bitstream to the chip. Theoretically, the JTAG interface with the Parallel Cable III has a maximum download speed of 300Kbps [22] . The measured data transfer rate using JTAG in our experiments was 205Kbps because of the data transfer overhead between the host PC and the board. On the other hand, with the Parallel Cable IV which has a maximum download speed of 5Mbps [22] , a 4.28Mbps data transfer rate was measured in our experiments, again due to the data transfer overhead between the PC and the board. Alternatively, the SelectMap interface with Virtex II Pro can work at a maximum of 66MHz clock speed loading one byte per clock cycle, i.e. 528Mbps [23] . Hence the device programming time can reach as low as 8 milliseconds if the SelectMap is used.
The second reason is due to the large bitstream file used of 548Kbytes. The partial configuration bitstream file for the 4-Bit adder circuit along with the GNAT component is only 80Kbyte. When this file is used instead of the full configuration bitstream the device programming time is drastically reduced to 16 milliseconds using the JTAG with Xilinx Parallel Cable IV and to 150 microseconds using the SelectMap interface.
In Table 2 , the timing measurement of the probability-driven single point crossover and mutation operators for each run is listed. Similarly, Table 3 lists the experimental results of the probability-driven PMX and mutation operators for each run. On the other hand, Table 4 lists the experimental results of the probability-driven OX and mutation operators for each run, and finally, Table 5 lists the experimental results of the probability-driven CX and mutation operators for each run. While the conventional single point crossover favors the genetic material that yields high fitness and opts to find higher fitness offspring by propagating this material regardless of its chromosomal position to the next generation, the other ordering crossover operators such as the PMX, OX, and CX favor the combination of certain genetic material used in a certain chromosomal position that yields high fitness and proceed to finding higher fitness individuals by
propagating that combination to the offspring. This kind of behavior leads to a finer grained of search which may increase the GA time to converge into a solution as can be seen from the experimental results. It has more potential, however, to find higher quality solutions than the conventional crossover. This explains why the number of generations needed to reach full fitness using the conventional crossover has proved to be the fastest among the rest of the crossover types in the three experiments unseeded design, seeded design, and repair.
In order to estimate the robustness and overall performance of each candidate chromosomes, fitness evaluation needs to be carried out at the end of each generation. For a full set of hardware testing vectors, its size is directly related to the total number of input bits of the testing module.
Since for a hardware bit the input will be always '0' or '1', the total possible input vector combination will be To measure the exact time the mutation and crossover operations take, another experiment was carried out by setting the mutation and crossover rates to 100% to ensure that the operators are performed with certainty. This allowed measurement of the time for each operation individually.
The results of this experiment and similar experiments using Xilinx design tool driven flow and using JBITs are listed in Table 6 . It can be seen from the results that more than seven orders of magnitude enhancement over Xilinx design tool driven flow and three orders of magnitude enhancement over JBITs was achieved by the developed platform. It can also be seen from the results that the conventional single point crossover takes the highest time amongst the other crossover types which is around 4.2 microseconds. On the other hand, the PMX and the OX require equal time around 3.1 microseconds, while the CX requires the least amount of time around 2.8 microseconds. It is very intuitive that the CX operator takes less time than the others as it has no crossover points to choose and consequently has only one Chip based version using the PowerPC to execute the genetic algorithm. This is expected to significantly reduce the data transfer time relative to genetic operator time.
REFERENCES
