This paper describes a new FPGA implementation of a system for evolutionary image filter design. Three parallel search algorithms are compared. An optimal mutation rate and the quality of three pseudo-random number generators are investigated. The efficiency of proposed system is demonstrated on the problem of removing the salt-andpepper noise with intensity of 5%, 10% and 20% and designing an edge detector which works with input images corrupted by the salt-and-pepper noise.
Introduction
The image filter design problem is often approached by means of evolutionary design techniques. In addition to an optimization of filter coefficients (see, for example, [1] ), evolutionary approaches are applied to find a complete structure of image filters. Sekanina evolved Gaussian noise filters using a variant of Cartesian Genetic Programming in which target filters were composed of simple digital components such as logic gates, adders and comparators [9] . Later, image filters for other types of noise and edge detectors were evolved using the same technique [10] .
In order to speed up the evolutionary design process, an FPGA-based accelerator was proposed [6] . The accelerator uses the so-called virtual reconfigurable circuit to quickly evaluate candidate circuits. The accelerator implements a complete evolvable system in a single FPGA, i.e. the search engine, virtual reconfigurable circuit and fitness calculation unit are implemented as digital circuits using user logic available in the FPGA. This approach has been further developed by many authors [16, 4, 3] .
For applications in the area of embedded systems, Xilinx has introduced PowerPC processors into the families Virtex 2, Virtex 4 and Virtex 5. As illustrated by Glette and Torresen for a two-bit multiplier design problem [2] , the PowerPC processor can be used to implement a flexible search algorithm which, then, might be more sophisticated and efficient than a hardwired search algorithm. Recent paper [14] described a new FPGA implementation of a system for evolutionary image filter design in which genetic operations are carried out in the PowerPC processor. The main benefit of this architecture is that it allows the user to easily tune the search algorithm for a given problem while keeping the process of evolution on a single chip, i.e. very fast in comparison with a common PC.
This paper deals with an analysis of suitable parameters of various parallel search algorithms (random search, hill climbing and genetic algorithm), an optimal mutation rate and the quality of pseudo-random number generators for this new platform. The problem of interest is (1) removing the salt-and-pepper noise with intensity of 5%, 10% and 20% and (2) designing an edge detector which is able to deal with input images corrupted by the salt-and-pepper noise. Except the 5%-salt-and-pepper noise, other problems were not approached so far by means of evolutionary design techniques in literature.
Evolvable systems in FPGAs
In order to produce an evolvable system, three main components have to be implemented: a genetic unit, an array of reconfigurable elements and a fitness calculation unit.
The FPGA-based implementations of evolvable hardware systems can be divided into two groups: (1) The FPGA serves for the fitness calculation only. The evolutionary algorithm (which is usually executed on a personal computer) sends configuration bitstreams representing candidate circuits to the FPGA in order to obtain their fitness values. (2) The entire evolvable system is implemented in an FPGA. Instead of using ICAP, virtual reconfigurable circuits (VRC) have been used for evolvable hardware in the recent years [3, 2, 6, 8] . The VRC is a second configurable layer developed on the top of an FPGA in order to obtain a fast reconfiguration scheme and application-specific programmable elements. While its implementation cost is relatively high, it directly enables to connect the chromosome of evolutionary algorithm (EA) with the configuration memory of reconfigurable array.
The problem domain determines the type and number of reconfigurable elements. In some cases the evolutionary design is performed directly with reconfigurable cells of an FPGA [11, 13] ; in other cases a kind VRC is applied [3, 16, 2, 6, 8] . An evolutionary optimization of coefficients stored in registers represents the simplest example [12] . The EA and fitness calculation unit can be implemented either as an application specific circuit [12, 6, 8] or as a program. This program is running either in a personal computer [11] or in an embedded processor which is integrated into the FPGA. The embedded processor is typically available as a hard core (e.g., PowerPC in Virtex II Pro FPGA [2] ) or as a soft core (e.g., the MicroBlaze core [13] ).
Proposed architecture

Image filters in VRC
The proposed architecture was described in [14] . Every image operator is considered as a digital circuit of nine 8-bit inputs and a single 8-bit output, which processes grayscaled (8-bits/pixel) images (see Fig. 1 ). Fig. 2 shows a corresponding VRC which consists of 2-input Configurable Logic Blocks (CFBs), denoted as E i , placed in a grid of 8 columns and 4 rows. Any input of each CFB may be connected either to a primary circuit input or to the output of a CFB, which is placed anywhere in the preceding column. Any CFB can be programmed to implement one of functions given in Table 1 . All these functions operate with 8-bit operands and produce 8-bit results. These functions were recognized as useful for this task in [10] . The reconfiguration is performed column by column. 
Figure 2. VRC for image filter evolution
The computation is pipelined; a column of CFBs represents a stage of the pipeline. Registers (denoted D) are inserted between the columns in order to synchronize the input pixels with CFB outputs. The configuration bitstream of VRC which is stored in a register array conf reg consists of 384 bits. A single CFB is configured by 12 bits, 4 bits are used to select the connection of a single input, 4 bits are used to select one of the 16 functions. Evolutionary algorithm directly operates with configurations of the VRC; simply, a configuration is considered as a chromosome.
Table 1. Functions implemented in each CFB
code function description code function description 0 255
Search Algorithm
The proposed system allows the use of various parallel search algorithms. The algorithms, that we tested, will be described in Section 5. These algorithms utilize a population of candidate solutions and a single genetic operatormutation, which inverts k bits of the chromosome (i.e. of the configuration). No crossover operator is employed because it is currently unknown how to design it to be more efficient than the mutation operator. The PowerPC processor implements the genetic operations.
Fitness Calculation
The fitness calculation is carried out by the Fitness Unit (FU). The pixels of corrupted image u are loaded from external SRAM1 memory and forwarded to inputs of VRC. Pixels of filtered image v are sent back to the Fitness Unit, where they are compared with the pixels of original image w which is stored in another external memory, SRAM2. Filtered image is simultaneously stored into the third external memory, SRAM3. The design objective is to minimize the difference between the filtered image and the original image, i.e. the fitness value is calculated for M × N -pixel image (note that border pixels are ignored) as
As the 3 × 3 pixels of the image window are not stored at neighboring addresses of SRAM, the hardware implementation of the fitness unit utilizes three first-in-first-out raw buffers, special addressing circuits and comparators to extract the filtering window from memory. The FU can be considered as an extension of the VRC pipeline. Hence, in each clock cycle, a temporary fitness value is updated by a new pixel difference.
Top level entity
As Fig. 3 shows, the proposed architecture (except the SRAM memories) is completely implemented in a single FPGA. All components (except the VRC) are connected to the LocalBus which is attached to the FPGA via a PCI bus. Now it remains to describe the Control Unit (CU), Processor and Memory Interface (PMI) and the PowerPC integration into the system.
In order to maximize the overall performance, the CU plays the role of master and controls the entire system. In particular, it starts/stops the evolution, determines the number of generations and other parameters of search algorithm and generates control signals for the remaining components. Upon the request, the PowerPC generates a new candidate individual, i.e. it is idle in its main loop. The instruction memory of the PowerPC is implemented using on-chip Block RAM (BRAM) memories and connected to the LocalBus in order to send/read programs to/from an external PC. However, since our program is short, it can completely be stored in an instruction cache.
The population of candidate configurations is stored in on-chip BRAM memories. The population memory is divided into banks; each of them contains a single configuration bitstream of VRC. An additional bit (associated with every bank) determines data validity; only valid configurations can be evaluated. In order to overlap the evaluation of a candidate configuration with generating a new candidate configuration, at least two memory banks have to be utilized. While a circuit is evaluated, a new candidate configuration is generated. A new configuration is used immediately after completing the evaluation of the previous circuit. If b banks are utilized, the PowerPC processor has b-times more time to generate a new candidate circuit (i.e. EA can be more complicated). The proposed implementation utilizes eight banks.
The PMI component consists of two subcomponents working concurrently. The first subcomponent, controlled by the CU, reconfigures the VRC using configurations stored in the population memory. The second subcomponent is responsible for sending the fitness value to the PowerPC processor. This process is controlled by the FU. The PMI component also provides an interface to the population memory via LocalBus. The evaluation works as follows:
1. When a valid configuration is available, the CU initiates the reconfiguration of VRC. This process is controlled by PMI.
2. As soon as the first column of CFBs has been reconfigured, CU initiates the fitness calculation process performed by the FU.
3. When the last column of CFBs has been reconfigured, a corresponding memory bank is invalidated and the bank counter is incremented.
4. Three clock cycles before the end of evaluation the FU indicates the forthcoming end of evaluation.
5. The CU initiates a new configuration of VRC and repeats the sequence 1-4 again.
6. As soon as the fitness value is valid, it is sent (together with a corresponding bank number) to the PowerPC. An interrupt (IRQ) is generated to activate a service routine of the PowerPC. In this routine, a new candidate configuration is generated for the given bank. The PowerPC processor acknowledges the interrupt (IRQACK) and sets up the validity bit.
These steps are pipelined in such manner as there are no idle clock cycles. Therefore, time of evolution can be expressed as
where Q is the number of evaluations, N ×M is the number of pixels and f is the operation frequency.
Results of synthesis
In order to implement the proposed system, we used a COMBO6X card equipped with Virtex II Pro 2VP50ff1517 FPGA [5] . Results of synthesis are summarized in Table 2 . While the PowerPC works at 300 MHz, the logic supporting the PowerPC works at 150 MHz. The remaining FPGA logic (including VRC and FU) works at 50 MHz. Experimental results show that approximately 3,000 candidate filters can be evaluated per second (N = M = 128). 
Description of search algorithms
Three parallel search algorithms are evaluated: a random search, a hill-climbing algorithm and a genetic algorithm. Random search (RS): This algorithm operates with p individuals that are generated randomly at the beginning of the evolution. Then an offspring is created using a bit-mutation operator from each parent and evaluated. If the offspring is equal or better than its parent then the offspring replaces the parent in the new population. In fact, p standard random search algorithms run in parallel. This algorithm was implemented in [6] as a special circuit. Fig. 4 shows concurrent operations of several processes running in hardware and the PowerPC processor (including the configuration of the VRC, evaluation of candidate filters and generation of candidate configurations). These processes are synchronized in such a way that no clock cycle is lost because of waiting on some resources. Note that only two banks are considered in this example.
Hill Climbing search (HC): This algorithm operates with p individuals that are generated randomly at the beginning of the evolution. After their evaluation, r offspring configurations are generated for each parent using a bitmutation operator. The best offspring of the r offspring configurations replaces the corresponding parent; however, only in case that its fitness value is equal or better than the parent's fitness value. Again, in fact, p standard hill climbing algorithms run in parallel.
Genetic algorithm (GA): The initial population of p individuals is generated randomly. Then, r offspring are generated from each parent using a bit-mutation operator. A new population consisting of p individuals is formed from p parents and their p.r offspring. We used a deterministic selection in which p-best scored individuals are selected as new parents.
Experimental results
Experiments were arranged to find a suitable mutation rate and an efficient pseudo-random number generator. We also compared the three search algorithms. The objective was to (1) remove the salt-and-pepper noise with intensity of 5%, 10% and 20% from real-world images and (2) design an edge detector which is able to deal with input images corrupted by the salt-and-pepper noise. A visual quality of filtered images is expressed in mdpp which stands for the mean difference per pixel between the filtered image and original image.
The mutation rate
Our strategy is to estimate the suitable mutation rate using not so many evaluations (less than 100,000 evaluations allowed) and then to utilize the discovered mutation rate in long-time experiments. Figure 5 shows average mdpp calculated from the best mdpp values at the end of 32 independent runs of the RS algorithm (p = 8) for each of k = 1 − 127 inverted bits in the chromosome. Two methods are used: exactly k bits are always inverted (denoted as "fix" in Fig. 5 ) and a randomly chosen number of bits is inverted; however, limited by k (denoted as "rnd" in Fig. 5 ).
We can observe in Figure 5 that the mutation rate which allows minimizing the mdpp is usually 20 bits per chromo- some, i.e. 5.2%. It is also more efficient to invert exactly 20 bits than to randomly generate a number from interval 1 − 20.
Pseudorandom number generators
As the outputs of pseudorandom number generators (PRNG) only approximate some of the properties of random numbers, we have to determine a suitable one for the proposed architecture. The following three PRNGs were evaluated:
Linear congruential generators represent the oldest and best-known pseudorandom number generator algorithms. It is, however, well known that the properties of this class of generators are far from ideal. The applied linear congruential generator operates according to formula
Linear Feedback Shift Register (LFSR) is a shift register whose input bit is driven by the exclusive-or (xor) of some bits of the overall shift register value. As for this PRNG is also known that output bits do not pose a good distribution we used a parallel LFSR consisting of 32 independent and different LFSRs seeded identically.
Mersenne Twister algorithm is a twisted generalized feedback shift register that avoids many of the problems with earlier generators. It has the colossal period of 2 19937 − 1 iterations, is proven to be equidistributed in (up to) 623 dimensions (for 32-bit values). A standard implementation of Mersenne Twister was utilized [7] . Figure 6 shows average mdpp and corresponding standard deviations obtained from 40 independent runs (after 12,288 evaluations in each run) using the RS algorithm (p = 8, "fix" mutation applied on 20 bits). The three generators are compared on two problems: removing 10% salt-andpepper noise from Lena image (the first image shows mdpp, the second image shows a standard deviation) and edge detector design (the third image shows mdpp, the fourth image shows a standard deviation). Surprisingly, there are not any significant differences in the quality of obtained results. Table 3 provides parameters of experiments arranged in order to compare the three algorithms -RS (p = 8), HC (p = 8, r = 2) and GA (p = 8, r = 2). As a training image we used a 128 × 128-pixel version of Lena image (XLena) which contains a given type of noise in some regions. Table 4 summarizes obtained results. We can observe that while the best average mdpp is always obtained by means of the RS algorithm, the GA always produces a filter with the smallest mdpp at all. Recall that the number of evaluations is identical; however, RS always produces more generations than the GA. Figures 7 and 8 give examples of images filtered using the best-evolved filters. Table 5 compares mdpp of the best-evolved filters and conventional filters (median and Sobel operator) on a set of 256 × 256-pixel test images. 
Comparison of search algorithms
Discussion
The proposed FPGA implementation of image filter evolution can generate a solution approx. 22 times faster than a PC with Celeron 2.4GHz (i.e. 3000 evaluations/s). This allows us to perform a detailed evaluation of various aspects of used algorithms in a reasonable time. Experimental results show that the use of a parallel RS algorithm is a good choice in this case; the RS produces the best results in average. However, we compared the number of evaluations. If the number of generations were compared, the GA is able to find a suitable solution much faster than the RS. Therefore, there is a tradeoff between the number of generations and the number of evaluations. If an average solution is required, it is better to run the RS which provides an average filter quickly. However, if a perfect filter is a must and more generations can be produced, the GA should be utilized. In comparison with [6] the proposed implementation requires almost identical amount of logic on the chip. In addition, the PowerPC processor is employed. However, the proposed solution offers the possibility to easily change the search algorithm which is impossible in the former one.
The images filtered by evolved filters are not as smudged as the images filtered by median filters. Moreover, evolved filters occupy only approx. 70% of the area needed to implement the median filter on the same FPGA. The use of various pseudo-random generators has no effect on the quality of evolved filters and speed of evolution. This result is also surprising. Is there any relation to the fact that the parallel random search exhibits the best performance? This is an open question for future research.
Conclusions
We evolved image filters for three types of noise which were not approached by means of evolutionary design so far: the salt-and-pepper noise with intensity of 10% and 20% and 5%-salt-and-pepper noise existing in the image in which edges should be detected. Evolved filters are at least comparable with a conventional solution which is based on the median filter. As Fig. 8bf shows, in contrast to evolved filters, images filtered by the median filter are smudged. The proposed platform can be considered as an efficient "designer" of image filters which can be utilized in sophisticated filtering schemes for real-world applications. 
