Abstract-Due to asynchronous timing and arbitration asynchronous designs may behave nondeterministically. For the test of such systems, this means that an exact timing, i.e. a tester cycle, of a test response cannot be guaranteed. This behavior makes functional tests of asynchronous designs relatively complex or even impossible. Therefore, this paper presents a concept for performing functional tests of asynchronous designs using a test processor infrastructure. To this end, we propose a low-cost 16-bit microprocessor solution with special support of asynchronous handshake signalling that can either be integrated into the device-under-test (DUT), mounted on the load board of the tester or a combination of both.
I. Introduction
The ongoing evolution in semiconductor industries allows the integration of a complete system within a single chip. However, the integration of a complex system into a single synchronous design may introduce upcoming issues including process variations, clock skew, distribution of the clock signal, increased electro-magnetic interference (EMI) due to simultaneous switching activity etc. [1] . Thus, designers have to consider the use of alternative design methodologies rather than the pure synchronous design approach.
Purely asynchronous, globally-asynchronous locallysynchronous (GALS) and de-synchronization design techniques seem to be a promising option to overcome these issues [2] . The basic concept of data transfer in such systems is based on handshaking. Handshaking systems are more robust against process variations, offer improved modularity [1] and are suitable for security applications such as smart cards.
Due to these properties much research is being carried out in the field of asynchronous circuit design. This includes circuit development itself as well as development of tools in order to automate the design flow. The complementary research field is verification and test. Several methods have been proposed that structurally test asynchronous circuits by using adjusted scan techniques [3] - [7] . A further standard technique is the adaption and integration of Built-In SelfTest (BIST) into asynchronous designs as, e.g., addressed in [8] - [10] . Furthermore, on-line test techniques have been applied, e.g., to check the correct operation of asynchronous protocols [11] , [12] .
Apart from these well established test methods mainly applied in production test, further test strategies are required for debugging during the prototype design phase. Here, in addition to structural tests used to detect manufacturing defects, functional tests of the developed design play a major role in order to prove the concept. Such functional tests are based on patterns gained from functional simulation of the chip that are either event-based -e.g., EVCD-filesor cycle-based -e.g., WGL-files. No matter which pattern type is used the related tester equipment exactly performs the actions described in the pattern. The problem with asynchronous and GALS designs is that they may behave nondeterministically regarding their timing. This behavior cannot be described using standard patterns and, therefore, not be handled by both strictly cycle-oriented and eventbased hardware testers. As a consequence, a test may fail, although the device is fault-free.
This paper presents a concept of how elastic, functional tests of an asynchronous design can be performed using a test processor (TP) solution. Therefore, we describe the general problem and discusses related work in section II. Section III outlines the concept of integrating an additional TP into a test strategy. In section IV the concept of the TP, its implementation and instructions are described. Section V outlines the workflow to integrate the TP into the test strategy. Experimental results are given in section VI. Finally, section VII concludes the paper and gives an overview of the future work.
II. Discussion of the problem and related work Purely asynchronous and GALS design methodologies are applied to tolerate process, voltage and temperature variations (PVT) as well as to lower the emission of electromagnetic interference (EMI). Such designs make usage of handshake protocols in order to synchronize the modules during data transfer. The circuitry used to generate the respective handshake signals is mainly realized by asynchronous logic composed of special asynchronous cells such as Muller-C and MUTEX elements for synchronization and arbitration. These cells, especially MUTEX cells, may run through an undetermined time of meta-stability before their outputs become stable. Placing such elements in the control logic of a circuit, as usually done in asynchronous designs, leads to nondeterminism in the timing behavior of the entire circuit, as, e.g., also reported by [13] . Thus, a response of an asynchronous design-under-test (DUT) may occur in a specific time interval rather than at a specific point in time as usual for synchronous designs, where all output signals are aligned to a clock signal. This leads to problems when an asynchronous design shall be tested using a standard tester equipment.
A further problem is the fact that a standard hardware tester is not able to react on signal events generated by the DUT. Therefore, it is not possible to implement a handshake mechanism between the DUT and the tester or to let the tester capture data aligned to a clock signal generated by the DUT. The same is true for event-based testers which are not able to react on signal events from the DUT either. (The term event-based tester is related to the fact that such testers do not need a cyclization of the test patterns.)
Due to these problems, the test methodology mainly used to perform functional tests is BIST, as proposed in [8] - [10] . However, BIST technologies are very limited in debugging the produced designs and should, therefore, be complemented by further mechanisms to perform functional tests. To this end, M. W. Heath et al. [13] , [14] have developed a way to eliminate nondeterminism of GALS designs. Their approach called Synchro-Tokens uses a token ring with a node on each of two communicating modules. The token is exchanged in a deterministic order resulting in a timing determinism of the entire GALS system. However, this architecture is only applicable to GALS designs, where the number of communicating modules is very limited. The additional overhead for the token exchange makes this methodology unfeasible for pure asynchronous designs.
The usage of a TP component for extending the available tester hardware is not a new concept. Several publications proposed TP solutions to implement programmable linearfeedback shift-registers [15] - [17] . In [18] a low-cost TP was proposed to implement multiple seed, multiple polynomial linear feedback shift register that also supports scan-path testing. In [19] the design of a prototype TP is proposed that can be used for functional testing of digital ICs. Therefore, the processor can generate pseudo-random followed by deterministic test vectors and receive the output responses of the DUT and compress them. In [20] C. Galke et al. have presented their TP solution used to enable self-tests of system-on-chips (SoCs). The processor has a simple RISC architecture and is of low overhead. It contains special adopted registers to realize LFSR or MISR functions for pattern de-compaction and pattern filtering. In [21] this concept was extended by making the processor design configurable and enabling the support of build-in logic tests.
III. Concept
In our approach we intend to establish the communication between the tester and the DUT in an easily adjustable and programmable manner using the provided TP solution. To this end the TP will be used to apply and receive patterns including handshake signalling sequences whose nondeterminism will be compensated by the processor, ending up in an elastic test. The general concept of the TP is shown in Fig. 1 . The innovation of our approach is the introduction of special support of asynchronous handshake protocols. This includes the extension of the standard I/O port component by circuitry dedicated to asynchronous handshake signalling as well as the introduction of instructions that control this circuitry.
The TP itself will be placed between the tester and the DUT. It provides standard synchronous interface ports to the tester and ports supporting asynchronous handshaking to the DUT. In order to store test patterns for the DUT as well as the test program for the processor, a memory is required. Its interface needs to support write and read operations from and to different sources and destinations, respectively. For example, besides the standard read/write operations required by the TP, the interface shall provide write operations coming from the tester side in order to program the processor and to upload test patterns. Therefore, Figure 1 : Concept of the test processor for asynchronous devices two I/O ports are directly connected with the memory interface enabling the tester to define the address and to write data into or to read data from the memory.
TP
The entire solution is divided into four main components, the memory, its interface, the TP core and the port component. Due to the huge overhead, the memory will be placed outside of the chip and will be integrated into the tester equipment. The TP itself can, if carefully designed, be realized in different ways described below. In practice it depends on the device which of these approaches shall be used.
• The simplest solution from the tester point of view is the complete integration of the TP into the device-undertest. This has the major advantage that the processor could have further access to internal signals of the design, thus, resulting in improved debugging capabilities. However, the overhead of integrating the processor into the DUT may be too large and, therefore, unsuitable for some designs. Nevertheless, the integration of the TP into the design can be feasible under consideration of the possibility to perform self-tests when the design is already mounted into the target system. Furthermore, the processor can also be used for other tasks in the design, e.g. power management.
• The second possibility is to place the processor onto the load board of the tester equipment. This approach does not require any additional silicon area for the DUT to enable this test concept. However, there is no possibility for the processor to access internal signals of the DUT. Moreover, the load board has to be specifically designed for the DUT and can most likely not be reused for a different design. (For high-volume industrial designs it is quite common that a load board is designed just for one device. But in prototyping phase, silicon debug and small series production it is useful to reuse existing boards as far as possible in order to minimize costs.) • The last possibility is to combine both approaches leading to a mixed implementation. One possible solution is to place the port component into the DUT. This will, of course, increase the overhead for the technique from the DUT point of view, but will have the advantage to access internal signals while keeping the overhead for this at a minimum.
IV. Test processor design

A. Implementation
For the realization of our concept we used the processor description language LISA in combination with CoWare's (now Synopsys') Processor Designer to model and generate the processor. This offers great advantages, e.g., easily adapting the core to the demands of the DUT via C preprocessor compiler directives; possibility of tool generation (assembler, de-assembler, compiler etc.); and the availability of the instruction based debugger. Additionally, the Processor Designer allows the integration of a debug interface into the generated processor. This eases testing the TP itself if it is integrated into the DUT. The current implementation of the TP makes extensive use of compiler directives in order to allow a configurable processor design. As an example, the processor can be configured to support different types of asynchronous handshake protocols, i.e. 2-phase singlerail, 4-phase single-rail and 4-phase dual-rail. Thus, one can easily generate a processor supporting any subset of these protocols via setting the according preprocessor switches.
B. Architecture of the test processor
In its current realization the TP is a 16-bit RISC processor supporting 53 instructions. The most important standard instructions are: Besides these instructions the processor provides special operations in order to support asynchronous handshaking and operations to ease pattern generation required for performing tests. Since many asynchronous devices are designed to process their data with high-throughput, the processor needs to be fast in order to perform functional tests. Therefore, the processor has a four stage processor pipeline (FE, DC, EX, WB) with operand forwarding, thus, is able to execute programs with minimal time overhead. The simplified architecture of the processor is given in Fig. 2 . The TP design itself operates synchronously which allows the integration of scan paths. This drastically increases the testability of the TP if the on-chip approach is selected to implement the test strategy.
The architecture of the register file is identical to the one of the TP provided in [20] . It contains 16 general purpose registers (GPRs), where two of them (R13, R15) can act as LFSR/MISR. Two further GPRs (R12, R14) are used for configuring the feedbacks of the LFSR. In order to generate signatures using the LFSR/MISR registers the TP provides special instructions:
• LFSRL/LFSRH -generate signature in R13/R15
• LFSRLP/LFSRHP Px -write the current value of R13/R15 to I/O port Px and generate new signature • MISRL/MISRH Rx -combine signature in R13/R15 with the value of register Rx
with the value of I/O port Px For the interaction with the tester and the DUT, respectively, the processor offers the previously mentioned port component that contains eight standard synchronous 16-bit I/O ports. Furthermore, in order to perform handshake signalling, this component provides four handshake signal pairs that are totally independent from the I/O ports. Thus, the test engineer has the full flexibility to connect and to program the processor according to the demands and specification of the DUT. For example, one can imagine that the width of the data word to be sent or received exceeds 16 bits. Then the test engineer can connect, e.g., P2 and P4 with the according signals of the data word and use, e.g., handshake signal port HP0 for performing the handshake. The current version of the TP has separated data and program memories. Both are realized using single port memory blocks. This reduces the overhead in comparison to one dual port memory for a unique block contain both memory sections. The interface of the memory blocks is designed to allow external accesses from the ATE in order to program the processor and to fill the data memory (e.g. with test patterns). The processor has an address space of 2 16 16-bit words of data and program memory space, respectively. Currently, there are two modes to access the memory:
• Via memory address register for the entire address space to load data using labels (2 instructions required, 1st for loading the address, 2nd for reading/writing the data) • Via GPR (requires 1 instruction, but it requires more effort to load the address of the data using labels)
C. Architecture of port component
As shown in Fig. 3 the I/O port component consists of eight 16-bit I/O ports. Each port can either be connected to the ATE or to the DUT, depending on the demands of the test application. As illustrated in Fig. 4 each port contains one I/O port register, one configuration register and some additional logic to select the input for the registers. The I/O register stores the data strobed from or to be applied to the DUT, respectively. The configuration register stores a mask to specify significant bits of the I/O register. This is useful to mask undetermined values (X-values) potentially present in a test response of the DUT. In order to write into and to read from the I/O register, respectively, the processor provides two instructions: PIN and POUT. Accordingly, the mask register can be configured and read out using the PRCFG and PWCFG instructions. Moreover, the port component includes the logic block that realizes asynchronous handshake signaling. Currently, the block provides four handshake signal ports to the DUT. However, the number of handshake ports (HP) can be further scaled with simple adaption of the LISA source code. Each HP comprises an input port used for requests or acknowledges coming from the DUT and an output port used for requests or acknowledges coming from 
8×16-Bit I/O-Port
EnCfg EnIO the TP. For the realization of the dual-rail protocol a different mechanism is required, since in this type of protocols the request is encoded within the data word. Therefore, each data bit requires two signals (d f d t ) used to encode a logic-0 (10), a logic-1 (01) or the null-value (00). To implement this, two consecutive ports P x and P x+1 (x ∈ {0, 2}) encode the dual-rail data word. Thus, data bit at position i is a valid dual-rail encoded value if P x [i] = P x+1 [i] .
D. Instructions to support asynchronous handshaking
In order to enable asynchronous handshaking the processor provides special instructions divided into different groups. The processor is designed such that it performs all necessary steps to set and receive requests/acknowledgments according to the protocols currently supported.
The following instructions are used for the initialization of the port component:
• SAHP type -Sets asynchronous handshake protocol type (type ∈ {single_rail_2_phase, single_rail_4_phase, dual_rail_4_phase}) • SRO/SAO/SRI/SAI state -Set initial states of handshake signals (state = 0bs 3 s 2 s 1 s 0 ; s i ∈ {0, 1}). This is required for supporting protocols that do not start with all handshake signals set to logic-0 After defining the protocol type and the initial values of the handshake signals the processor is set up to perform handshakes. Therefore, there are two types of instructions, namely active and passive handshake instructions. Active handshake instructions are used to set a request or an acknowledgment. Active handshake instructions are:
• SREQ/SACK HPx -Sets request / acknowledgment to initiate data transmission or indicate data receipt at handshake signal port HPx.
• PDROUT Ry, Px -writes dual-rail encoded data from a source register to two consecutive I/O data ports (Px[i] = Ry, P(x+1)[i] = Ry) On the other hand, there are two types of passive handshake instructions, i.e. branch and wait-for instructions. The major difference between these instruction types is that branch instructions perform a branch in case of a detected signal event coming from the DUT. Thus, the branch instructions -BREQ/BACK HPx, @RDestAddr -branch to the destination defined by register RDestAddr if an incoming request / acknowledgment is detected on handshake signal port HPx. In comparison to that, wait-for instructions stall the pipeline of the processor until a signal event is detected. Hence, the wait-for instructions -WFREQ/WFACK HPx, AFTER RTimeout JMP @RDestAddr -stall the entire pipeline of the pipeline and wait for a request / acknowledgment at handshake port HPx. In order to prevent program deadlocks, the wait-for instructions have two additional parameters -one to specify a program address and one to determine the number of TP cycles after which the test processor branches to the specified address. This can, e.g., be used to set an error flag. As mentioned, the branch instructions do not stall the pipeline of the processor. This offers more flexibility. For example, there might be situations where the processor is waiting for an acknowledgment at an input of the DUT while the DUT is also waiting for an acknowledgment of the processor. In this case wait-for operations are unsuitable. With usage of branch instructions one can, e.g., first test for the appearance of the acknowledgment of the DUT and, if not present, immediately test for a request of the DUT. However, the wait-for operations are much faster in reaction so their usage is recommended if the behavior of the DUT is strictly specified.
For the realization of 4-phase protocols there are two additional instructions in order to implement the return-to-zero phase of the protocol, i.e. RTZI/RTZO. These instructions perform two micro-operations at once. On the one hand the handshake signals or -in case of a dual-rail protocol -the data signals are set to the zero-value of the protocol, i.e. the initial value defined using the SAHP instruction. On the other hand, these instructions wait for the according signal event generated by the DUT. Thus, RTZI first waits until the request of the defined port is set to the initial value and then sets the acknowledgment to the initial value, while RTZO first sets the request to the initial value and then waits for the arrival of the acknowledgment from the DUT. Therefore, the format of these instructions is almost identical to the waitfor instructions: RTZI/RTZO HPx, AFTER RTimeout JMP @RDestAddr.
E. Programming the processor
Listing 1 shows a code example of how a possible program of the TP can look like. The first action in a program for the proposed TP is the definition of the handshake protocol type. After that, one needs to define the initial values of the handshake signals. This is required for both input and output handshake signals, since the processor needs to know what has to be checked to indicate an incoming signal event and what needs to be done to generate the signal event according to the protocol. After this instruction sequence, the configuration of the mask for the I/O ports P0 and P1 is defined. In this example all bits of the I/O register are significant and, therefore, the mask is set accordingly. Then, the configuration of the feedbacks for the LFSR and MISR registers is defined. Here, R13 is used as LFSR to generate patterns and R15 is used as MISR to receive patterns and to store the resulting signature. Finally, the immediate value 10 is assigned to R0 which is later on used as a timeout for the wait-for operation. Additionally, the addresses of labels required for the branch instructions are loaded into R2-R4.
After the initialization phase the processor is able to perform handshakes. The instruction block at label _dout is used to transfer the current pattern stored in the LFSR (R13) to I/O port P0 while simultaneously generating a new Listing 1: TP program example
; I n i t i a l i z a t i o n _main: SAHP 2 _phase_single_rail ; s e t u p p r o t o c o l SRO 0b0000 ; s e t u p i n i t i a l SAI 0b0000
; handshake s i g n a l s SRI 0b1111
; ;
s e t t i m e o u t SLB R2 , _din ; l o a d a d d r e s s o f _ e r r SLB R3 , _din ; l o a d a d d r e s s o f _ d i n SLB R4 , _dout ; l o a d a d d r e s s o f _ d o u t ; Handshake s e q u e n c e _dout:
;
move p a t t e r n t o P0 LFSRLP P0
; and compute new p a t t e r n SREQ HP0
; s e t HP0−REQout JMP @_breq ; loop
_din: ; combine d a t a i n p o r t MISRHP P1
P1 w i t h MISR s i g n a t u r e SACK HP1
; s e t HP0−ACKout JMP @_back ; loop
_back: ; w a i t f o r P0−ACKin WFACK HP0 , AFTER R0 JMP @R2 JMP @_dout ; jump t o _ d o u t a f t e r ; HP0−ACKin was d e t e c t e d _breq:
; w a i t f o r P1−REQin BREQ HP1 , @R3 ; jump t o _ d i n JMP @_back ; loop
_err: INC R1 ; i n c e r r o r c o u n t e r JMP @_req
; r e t u r n t o l o o p pattern and to set the respective request. Then, according to the program the sequence at label _breq is executed that performs a branch in case a request coming from the DUT is detected. If so, the processor jumps to label _din where the data coming from the DUT at I/O port P1 is strobed and combined with the signature in the MISR register R15. If not, it executes the code directly following the branch instruction. In this case the TP jumps to the upper part of the loop where the TP shall wait for the occurrence of an acknowledgment at HP0. At this point the complete pipeline is stalled until either the acknowledgment occurs or the maximum number of TP cycles exceeds the timeout limit defined by R0. In the first case the TP continues its operation and jumps to _dout. In the latter case the TP jumps to the error handling sequence that increases an error counter.
V. Workflow The integration of the TP into the test strategy requires further steps to be done by the test engineer but also by Figure 5 : Adapting the test processor the designer in case the on-chip approach shall be applied. Therefore, we propose the workflow given in Fig. 5 that outlines the required steps. The first activity in the workflow is to analyze the design and to determine the demands of the test. This activity directly delivers the test specification including level definition and a list of tests to be performed. After that one needs to clarify whether the TP already fits to the identified demands. If not, the TP can be adapted (e.g. with respect to the protocols to be supported) resulting in a customized version of the TP. At this point one needs to select the realization scheme (off/on-chip or mixed approach). If the on-chip or the mixed approach is chosen, (components of) the TP need(s) to be integrated into the DUT. In any case one will end up with the final TP and DUT designs. On the one hand these designs are required to generate test patterns. On the other hand the designs are needed for identifying the handshake and data ports of the DUT to be connected with the TP as well as the TP ports to be connected with the tester. Thus, a complete port configuration is delivered that is prerequisite for the development of the adapter board and for the reorganization of patterns. This is required since the (functional) patterns are just generated for the DUT without consideration of the TP and the timing nondeterminism. Thus, the control signals dedicated to asynchronous handshaking need to be removed from the pattern and replaced by respective handshake operations executed by the TP. Therefore, we plan to develop a tool that separates the data from the handshake vectors (i.e. the control flow). The data vectors directly designate the content of the data memory of the TP. Then each handshake in the control flow is translated into a corresponding sequence of TP instructions. The resulting code is compiled then and afterwards translated into a pattern which will be extended by additional control vectors required for setting up the TP. These steps of reorganizing the patterns will be performed by a tool that is currently under conception. Finally, each of the original patterns within the previously created testflow will be replaced by the generated patterns.
VI. Experimental results and practical application
In order to show the applicability of integrating the entire TP into a design we have synthesized different versions of the processor using Synopsys Design Compiler and our IHP 0.25 μm technology. We implemented protocol specific versions that support only one protocol, one version supporting all protocols and one version with no handshake protocol support. For each version we defined an operating frequency of 25 MHz. In order to have an expectation of the size of the proposed TP, we also synthesized two reference designs: a modified version of the TP solution provided in [20] -the T50 -with a target frequency of 25 MHz and a latch-based version of an Texas Instruments MSP430 compatible 16-bit microcontroller -the IPMS430x [22] -that was synthesized for a target frequency of 12.5 MHz. Fig. 6 shows that our TP with no protocol support utilizes approx. 130% of the area of the T50. However, since our processor has a fourstage pipeline that supports features like operand forwarding which significantly accelerates the program execution, this overhead is still acceptable. Furthermore, the processor versions dedicated to one protocol have an area overhead of 8-13% in relation to the version without any protocol support. The processor version that supports all protocols has an area overhead of 17%. Moreover, we validated the TP by implementing the version that supports all asynchronous handshake protocols using a Xilinx board equipped with a Virtex-4 for a frequency of 50 MHz.
In consideration of the practical application of the TP, the current operating frequency chosen for the synthesis may be too low for testing high-performance designs. This is even more emphasized with regard to the fact that the processor requires at least three instructions in order to apply/receive one data word of 16-bit and to perform one handshake. Additionally, the processor needs to receive and apply data. Thus, with an operating frequency of 25 MHz the effective frequency of applying and receiving data in one sequence is below 4.17 MHz. However, since the TP design is relatively simple we expect that the frequency can be increased to a value of approx. 100 MHz and more. Furthermore, for the off-chip approach one can think about the utilization of two processors: one for applying and one for receiving patterns.
In this paper we just want to propose the general concept of our TP and its application in the field of asynchronous circuit testing rather than providing a complete solution.
VII. Conclusions and Future work In this paper we presented a concept of a TP that aims at functional tests of asynchronous semiconductor devices. Therefore, the processor provides special support for asynchronous handshake signalling. We also proposed several realization versions of the concept, i.e. one can implement the processor off-chip using an FPGA and mount it onto the load board of the tester; one can integrate the entire processor on-chip into the design, and finally one can combine both approaches by integrating several components into the design and realize the processor core via the offchip approach. The processor itself was implemented in a configurable manner using the LISA processor description language. Thus, one can easily generate a processor that supports any combination of the most common asynchronous handshake protocols. First synthesis results have shown that even the overhead for integrating the processor can be tolerated under consideration of the possibility of performing in-field tests and/or power management. Until now only the FPGA prototype of the processor was realized in order to show the general applicability. In the near future we plan to further optimize the design with respect to its operating frequency, support direct memory accesses (DMAs) of the ports and further improve the instruction set. Furthermore, a tool is required that performs the pattern reorganization. Finally, we intend to develop a special PCB board equipped with an FPGA which shall be mounted onto the load board of the tester equipment to implement the desired test infrastructure.
