Abstract-Scan design has a performance penalty that affects the critical path delay by an added fanout at the origin and a multiplexer at the des tination.
vagmwal@eng.auburn.edu
Abstract-Scan design has a performance penalty that affects the critical path delay by an added fanout at the origin and a multiplexer at the des tination.
This problem is outlined in a recent paper [10] , which also proposes a solution. The reported results [10] .
Introduction
A recent paper [10] describes a problem that occurs in scan testing. In scan design, every flip flop is preceded by a multiplexer that adds ap-978-1-4673-2356-7112$26.00 �2012 IEEE proximately two gate delays to the combinational path. Consider Figure 1 (a) that pictures the crit ical path of a sequential circuit. The dotted line arrow represents the longest combinational path of the circuit whose delay determines the clock pe riod of the circuit and, hence, the performance. Figure 1 (b) shows the same path after scan im plementation. The multiplexers, controlled by a common scan enable (EN) signal, select either the normal mode data signal or the scan-in (S-IN) sig nal into flip-flops [2] . Scan also adds an additional fan-out, shown as scan-out (S_OUT), at the out put of each flip-flop. Thus, the critical path slows down by the delays caused by one multiplexer and the fan-out.
One remedy often suggested to reduce the per formance penalty is partial scan where the desti nation flip-flop of critical paths are excluded from scan. This, however, potentially reduces the fault coverage. An ingenious solution without having to resort to partial scan has been proposed in a recent paper [10] . That solution is outlined in Section 2. The present contribution provides a re timing solution, which has certain advantages over the previous method. Retiming is a graph theo retic technique [7, 8] with applications to digital design optimization. It is outlined in Section 3. Section 4 describes the new retiming solution and Section 5 gives some results for comparison. 
Previous Work
A recently proposed method [10] to reduce the performance penalty of scan modifies the critical path of Figure 1 (b) as shown in Figure 2 . The multiplexer is moved forward to the output of the destination flip-flop (original FF) and is replaced by a fanout that feeds into an added flip-flop (shadow FF) through an additional multiplexer. Thus, in the normal mode (EN = 0) both origi nal FF and shadow FF contain same data. In the scan mode (EN = 1) the scan data is transferred through the shadow FF using the two multiplex ers, which completely isolate the original FF.
In general, there can be several critical paths in a circuit. Each critical path is modified with a shadow FF and an additional multiplexer at its input. These multiplexers are controlled by the original scan enable signal (EN). However, all orig inal multiplexers after being moved to the outputs of the original FFs are controlled by a common SeLshadow signal generated by a single SR latch and AND gate arrangement shown at the bottom of Figure 2 . Besides the scan enable signal, EN, a new "Test" signal is required. The clock (elk) signal is the same for all flip-flops, though an ad justable delay is needed to balance any clock skew at the SR latch. "Test" ensures that Sel....s hadow = 0 during the normal mode. The purpose of this circuit is to synchronize the Sel.... shadow signal with the clock.
The design of Figure 2 is shown to perform both normal and scan modes correctly [10] . The cost in hardware overhead is one shadow FF and one mul tiplexer per critical path, and a single Sel....s hadow generation circuit. In addition a new "Test" sig nal is required. The timing penalty of scan multi- plexer is reduced to that of a single fan-out. In the present work, we will give a simpler design using less hardware and eliminate the fan-out and the "Test" signal. The main idea used is the retiming transformation described next.
The Retiming Transformation
Retiming transformation of a circuit moves all of the memory elements at the input of a com binational block to all of its outputs, or vice versa. Provably, this procedure leaves the func tion of a synchronous circuit unchanged [7, 8] . Since its first publication in 1983, numerous ap plications in digital design automation have been found. They include minimization of state vari ables, reducing logic, reducing power consump tion, improving testability and timing optimization [4, 9] . Figure 3 provides a simple illustration of the retiming transformation.
Retimed Scan Architecture
The two time penalties on the critical path, namely, multiplexer delay and fan-out delay, can be independently removed. Figure 4 shows a retiming transformation of the circuit of Figure 1 receives and forwards S_IN. The third flip-flop simply delays the EN signal by one clock cycle. According to the retiming rules [7, 8] , all three flip-flops have the same clock as before.
Eliminating Multiplexer Penalty
In general, a circuit may have several critical paths ending on separate flip-flops, all of which will be transformed as shown in Figure 4 . However, the third flip-flops of all critical path destinations can be combined into just one flip-flop to generate an EN _del signal to control all "pushed out" multiplexers. This is shown in Figure 5 .
Eliminating Fanout Penalty
The fan-out added to the source flip-flop of a critical path inserts some extra delay that may of ten be acceptable. If that is not the case then the source flip-flop can be moved forward across the fanout as shown in Figure 6 . These flip-flops, named "original FF" and "shadow FF" receive same data and clock. The original FF feeds data directly to the critical path and shadow FF serves the scan path. Notice that the critical path in Fig  ure 6 has exactly the same combinational delay as the original non-scan circuit of Figure 1 (a).
A Limitation of the Technique
Because our technique basically transfers the scan-induced delays in critical paths to their ad- joining paths, it will not work in two specific cases:
1. Two adjoining critical paths, i.e., one critical path feeds into another critical path. Here the destination flip-flop of one path will be same as the origin flip-flop of the other path.
2. Feedback critical path, i.e., a critical path that originates and ends at the same flip-flop.
We should remark, however, that in practical cases two adjoining paths, both near critical, may not have exact same delays. In those cases retim ing can potentially improve the overall timing of the circuit by transferring excess delay from one path to the other.
Results
Results for retiming of several circuits are shown in Table 1 . The performance penalty was eliminated for all circuits except s35932, which belongs to the categories listed in subsection 4.3. Last three circuits, Mickey-128, Trivim and Grain, are high speed circuits that implement ecrypt stream cipher algorithms [1] . In these results, both multiplexer and fanout penalties were eliminated. The previous method [10] , used for comparison here, only eliminates the multiplexer penalty.
The number of transformations listed in the sec ond column of Table 1 is related to the number of critical paths in the circuit all of which were fixed. An iterative procedure is used [10] . Static tim ing analysis (STA) first identifies a critical path in the scan circuit. That path is retimed and STA is again applied to ascertain that the critical path delay of the circuit is indeed reduced and to find another critical path, which is now retimed. This process stops when the critical path delay of the circuit cannot be reduced. If, however, the critical path delay is found to increase after retiming then the process stops accepting the minimum delay so lution from the previous iteration. The procedure was first carried out for removing the multiplexer penalty alone and then repeated for the fanout penalty. The number of iterations in the second column of the table shows the total number of it erations for removing both penalties.
The third column shows CPU times for trans forming circuits on a Redhat Enterprise Linux 5 system running on Intel Xenon CPU E5520, 2. 27GHz, 8 and 587.21GB local disk. The CPU time depends on the circuit size and the number of critical paths but does not seem to increase excessively. The fourth column shows percent area overhead for the removal of multiplexer penalty alone, which includes one shadow flip-flop per critical path and one extra flip-flop per circuit for generating the delayed EN signal as shown in Figure 5 . This overhead is proportional to the number of critical paths but reduces as the circuit becomes larger. The overheads for the previously reported tech nique [10] , which also removed the multiplexer penalty only, are shown in column 8 and are higher because an additional multiplexer per critical path was required.
Next, examine the multiplexer delay saving in columns 6 (retiming method) and 9 (previous method [10] ). Retiming allows a complete elimi nation of the multiplexer delay while the previous method removes the multiplexer but adds a fan out in its place contributing to some delay. As a result the critical path delay reduction for retim ing is better than the previous technique.
For the calculation of path delays, a single-input gate or an inverter is assumed to have one unit of delay. The delay of a gate with fanin ni n and fanout n o ut is computed as, Gate delay = 2 x pog2 ni n 1 + n o ut -1 units where ni n :2: 2. Thus, a two-input gate with a sin gle fanout will have a delay of 2 units. The above formula estimates the delay by assuming that a gate with larger number of inputs is split into a balanced tree of two-input gates. Each fanout be yond one contributes one extra delay unit.
Columns 5 and 7 give the result after removing both multiplexer and fanout penalties. The over head shown in Column 5 includes the overhead of Column 4 and an additional shadow flip-flop in serted in parallel with the flip-flop at the origin of each critical path as shown in Figure 6 .
With the exception of s35932, the delay penalty of scan was completely eliminated for all circuits and Column 7 of Table 1 merely shows the total penalty of the conventional scan that is generally quoted as 5 to 10% [2] .
Conclusion
Using the concept of retiming, we have shown how the performance penalty of scan can be com pletely eliminated in many circuits. There is a hardware overhead penalty that may increase with the number of critical paths. However, for sev eral example circuits considered in this paper, the overhead remains small. In general, one has to choose between the hardware and delay penalties based on the application of the circuit. Because the new design is obtained by retiming transfor mations alone, the circuit function in both nor mal and test modes is guaranteed to remain un changed. That means the new design will perform all types of scan tests, DC as well as delay (in cluding launch off shift and launch off capture) , without any change.
An often stated motivations for partial scan, in which only a subset of flip-flops is scanned, is to avoid the delay penalty in timing critical cir cuits [2] . Partial scan may reduce area and delay overheads but it results in higher test generation complexity and reduced fault coverage. The tech nique of this paper eliminates the need for such a trade off. Retiming can reduce the number of flip flops in a circuit thereby reducing the hardware overhead and test time of full [5] or partial [3, 6] scan test. Those techniques globally retime the en tire circuit. The retiming application of this paper is local and can be incorporated after any other optimizations have been done.
