Abstract
Introduction
Today System−on−Chip (SOC) devices target high performance applications in which fast time to market is of prime importance. For this reason SOC platforms are used. On these platforms there is a tradeoff of performance, flexibility and fast time to market. A typical platform usually combines a processor, memory and peripherals around a standard bus architecture such as the AMBA. Specific IP cores could be added to form derivatives targeting specific applications. This procedure reduces design time and increases flexibility.
The correct positioning of Integrated Circuits is a major issue to developers of Automatic Optical Inspection systems, for diagnostics of Printed Circuit Boards. Integrated Circuits will not function correctly unless they are placed on the board with a high degree of accuracy [1] . In this work adaptation to the problem of detecting rotations and displacements of Integrated Circuits occurs through the use of a Genetic Algorithm (GA) for Image Registration. The GA encodes a transform for multiple ICs in a chromosome. The GA then performs a simultaneous search for every IC in the image. This is a registration procedure with a comparison being performed between the real and the reference image. This is processing of a reference golden board and then finding faults on further captured boards. The mechanics of a GA are not described here, see [2] and [3] .
The system presented here is an Evolvable Hardware system, with adaptation occurring within software using a GA (this is adaptation at the algorithmic level as opposed to circuit level). The adaptive algorithm runs on a Unix Solaris host and is targeted for an embedded ARM7 System−on−Chip. Existing work on Evolvable Hardware concentrates on adaptation at the circuit level. An overview of this work is given in [4] . Evolvable hardware is designed to adapt to changes in task requirements or changes in the environment. The closest application presented in [4] to the work presented here is the data compression chip for electrophotographic printing. In this application a GA is used to search for a set of optimal templates which are used to reconfigure a hardware prediction mechanism.
This work in this paper is presented as follows. Section 2 presents the GA and optimization stages. Section 3 presents execution time statistics for the target implementation. Section 4 presents an enhanced architecture for the system. Section 5 draws some conclusions from the work.
Algorithm and Optimisation Stages
The GA used in this work is described in detail in [1] . Pseudo code for the algorithm is given in Figure 1 which also shows software task partitioning which is discussed in sections 3 and 4 .
The computational complexity of image processing operators and their associated execution times, within the framework of a GA, is a major issue. In this work reduction in execution time has been achieved in four main stages:− i) Reduce number of pixel transformations in the fitness function For typical Image Registration problems the number of pixel transformations is large. For this problem the number of pixel transformations can be reduced through searching in a local area of each IC, and therefore the whole image is not processed. This local processing is achieved through use of chip position information obtained through IC detection work which is described in [5] . Through having the chip position of the chips it is possible to perform transformations in an area local to the IC. This localised processing gives a major optimisation of the algorithm.
ii) Restriction of the search space
The search space for the GA is reduced using knowledge of the application area. The rotations and displacements of ICs have to be within very fine tolerances for the component to function correctly. Therefore the size of the chromosomes can be restricted and thus the search space is reduced.
iii) Pre−computation of IC rotations A significant amount of time could be spent rotating images of ICs. In the registration of multiple ICs there is a small number of rotation angles. So an effective technique for optimisation of the fitness function is to pre−compute IC rotations and store the rotated IC images in a list for later matching. The fitness function then involves a match procedure between captured and reference IC image. This is a key part of the algorithm which significantly reduces computation time for each generation of the GA. This process is performed using the pre−computed IC image list which stores the rotated reference IC images for matching to the captured IC images. When the chromosomes are decoded, the value for the rotation is used to index into the reference rotated pre−computation list. The captured image is then offset according to the chromosome displacement values. The captured image and the pre−computed reference image are then matched. This matching procedure involves checking the pixel intensity level of each pixel between the transformed captured image and the pre−computed reference IC image. The fitness is then the number of matched pixels divided by the number of pixels in the reference list image. IC image rotations are not therefore repeatedly computed as the algorithm runs.
iv) Use of Elitism and the Hill climbing operator
Elitism ensures the survival of the fittest member of the population to successive generations. The Hill Climbing operator modifies transforms in small stages to achieve faster convergence. This operator assists in the evolution of chromosomes by modifying them by small amounts. This modification corresponds to moving up or down the solution landscape in small stages. The evolution of the chromosomes was observed and it was found that very close to optimal solutions were being derived.
Through small local modifications to the chromosomes convergence can be achieved in much less time than only using a standard GA.
The values of rotation and displacement are modified by a single degree or pixel offset. If the fitness of the chromosome is increased then the fitter chromosome is re−encoded and inserted back into the population.
Execution Time on the Target System
The code was executed on an ARM7 System−on− chip target with the same parameters as the test on the host system. With a population of fifty individuals the GA takes on average 1400.88 seconds for a generation. Software profiling was then carried out to find which parts of the code were taking significant portions of the total execution time. It was found that the fitness function and the hillclimbing function were taking almost the entire CPU time to compute. Through further profiling the crossover, mutation and selection operators were found to be taking negligible amounts of time. From these results it is clear that to meet the performance targets of the application the processing time of the fitness and the hill climbing functions will have to be significantly reduced. Therefore in this work we propose the use of a high performance Digital Signal Processor IP core, which is based on the Texas TMS320C6000 series, to speed the execution of the fitness and hill climbing function. This enhanced architecture is given in the next section.
The speedup necessary from the DSP implementation, to give the required performance, can be calculated by comparing the performance of the host implementation, with the target execution statistics.
The host implementation results are on average 21 generations, and 534 seconds of CPU time to achieve convergence. This result is approximately 25 seconds a generation. Therefore the DSP will have to be used which speeds the execution of the fitness and hill climbing operators by a factor of approximately 56 times (the target generation time divided by the host generation time: 1400 / 25 = 56).
The Enhanced Target Architecture
The enhanced target architecture of the proposed system is given in Fig 2. The architecture shows the use of the Digital Signal Processor IP core for running the tasks identified in the previous section. The tasks are blocks marked as E and G on the pseudo code in Fig 1. All other tasks are indicated as being computed on the ARM7. However, the image rotation pre−computations could also be off−loaded to the DSP to reduce the amount of time the code takes to setup before the main processing begins.
The architecture of the DSP IP core is given in Fig 3  and is also described in [6] . The architecture is a high performance Very Long Instruction Word (VLIW) architecture. In this type of architecture multiple, independent functional units are used and multiple instructions are packaged into one very long instruction.
For example a VLIW instruction may include two integer operations, two floating−point operations, two memory reference and a branch, see [7] for further details.
This architecture consists of three main parts: the CPU, peripherals, and memory. Functional units operate in parallel in the Data Paths. The units communicate using a cross path between two register files, each of which contains 16 32−bit registers. Program parallelism is defined at compile time because there is no data dependency checking done in hardware during run time. The 256−bit−wide program memory fetches eight 32−bit instructions every single cycle. The devices come with on−chip program and data memory, which may be configured as a cache. Peripherals include a Direct Memory Access (DMA) controller.
To show that the use of a DSP IP core will give the required enhanced performance the code for the fitness function pixel match procedure was analysed. This analysis involved converting the ARM assembly language for this function, see [8] , into Texas assembly language, described in [9] and [10] , and then computing the total number of clock cycles taken by the code. The number of cycles taken by the pixel match code was found to be 79.9 * 10 9 . Further analysis has shown that 82% of the execution time is spent in the pixel match procedure, which for five generations on the target gives a total of 5804.44 seconds of CPU time.
The high performance TMS320C6416 is rated at 600MHz giving a 1.67ns instruction cycle time. The following calculation gives the speedup factor if the pixel match code is run on the DSP:− speedup factor = function total time / (total cycles * cycle time) speedup factor = 5804.44 / (79.9 * 10 9 ) * (1.67 * 10 −9 )
= 43 times
In section 3 we showed that the speedup required was 56 times. However, the assembly code transformation from ARM to Texas assembler is suboptimal. The assembler will apply many complex optimisations which will give the speedup factor required. Full details of these stages are given in [9] .
Conclusions
This paper has presented an evolvable hardware system for detecting the orientation and displacement of integrated circuits. This Image Registration process is an adaptive process which uses a GA. The GA derives transforms for matching images of integrated circuits. The paper has presented the GA and the main stages designed to reduce the execution time of the algorithm. The paper goes on to give results for the execution of the algorithm on both the host system and a target System− on−Chip.
The results show that an enhanced target platform is required to speed the execution of the fitness and hill climbing functions of the GA. An enhanced target platform is then described with software block function partitioning given. Analysis of the execution time of performance critical code sections shows that the use of a high performance DSP IP core will mean that the performance targets of the application can be met. We have demonstrated here in this work the feasibility of running complex EHW tasks on a conventional SOC platform, hence making the need for specially tailored one−off customised hardware architectures unnecessary.
Acknowledgements
The authors acknowledge contributions from the Department of Trade 
