Improving Thermal-Safe Test Scheduling for Core-Based Systems-on-Chip Using Shift Frequency Scaling by Tafaj, Enkelejda et al.
several embedded cores are concurrently tested at the system level to reduce test time. Conse-
quently, a signiﬁcant amount of research has been devoted to reducing power consumption during
test in order to overcome these issues. Several solutions have been developed for test planning dur-
ing embedded core design, as well as during chip-level system integration. Techniques falling in
the ﬁrst category include low-power scan chain architectures with gated clocks [16, 4, 14], scan cell
and test pattern reordering [3, 5], and low-transition test patterns generated by specialized ATPG
algorithms [19] and low-transition TPGs [18]. The second category of techniques is mainly based
on power-constrained test scheduling algorithms [2, 8, 10, 7, 6, 1, 13, 11, 12] and the recently
proposed thermal-safe test scheduling algorithms [15]. Unlike power-constrained test scheduling
approaches, the thermal-safe test scheduling method we have presented in [15] guarantees hot-spot-
free test schedules by ensuring that a given critical die temperature is not exceeded during test. This
is possible by limiting the maximum test concurrency in each test session based on the thermal
behaviour of the cores under test rather than on their power consumption.
In this paper we use scan shift frequency scaling as a means of lowering die temperature and
investigate its impact on the thermal-safe test scheduling process. In Section 2 we present an al-
gorithm which determines the appropriate scan shift frequency for each test session in order to
minimize the overall testing time and improve the ability to generating hot-spot free test schedules
under very tight thermal constraints. Scan shift frequency scaling also resolves eventual thermal
violations, issue which was not explicitly addressed in approach presented in [15]. An added ad-
vantage of this solution is that it does not require any modiﬁcation of the embedded cores which
was indicated as a potential solution in [15]. The minor drawback of the proposed approach is that,
during test, the scan shift clock may need to be changed from one test session to another. The
experimental validation of the proposed approach is discussed in Section 3.
2: Thermal-safe test scheduling using scan shift frequency scaling
The mean time to failure (MTTF)—a commonly used metric in reliability models—is based
on the Arrhenius equation, which shows reliability is decreasing exponentially with the absolute
junction temperature: MTTF = Ae
Ea
kT , where A is an empirical constant, Ea is the so-called
activation energy and k is Boltzmann’s constant [17]. The semiconductor industry is currently
using commonly accepted for the maximum operating junction temperature based on the device
package type. These have been well accepted as numbers relating to reasonable device lifetimes
and thus failure rates. For example, for devices fabricated in a molded package, the maximum
2allowable junction temperature is 150°C, while for devices assembled in ceramic or cavity DIP
packages, the maximum allowable junction temperature is 175°C [9]. Based on these practices, the
thermal-safe test scheduling approach proposed in this paper aims to produce solutions ensuring
that the maximum allowable junction temperature will not be exceeded during test. Throughout
this paper, the term “hot-spot” will be used to refer to cores that exceed the maximum allowable
junction temperature during test. Any tests running below this critical temperature are considered
to be “thermally safe”.
Accordingtothewellknownelectro-thermalduality, thereisalinearrelationshipbetweenthedie
temperature(T) and the power consumption(P) [17]. Since dynamic power consumption is directly
proportional with the clock frequency, it can be concluded that there is a linear dependency between
the die temperature and the operating clock frequency. In scan based test, the shift cycles dominate
the testing time, and consequently the thermal behaviour of the silicon die during test. In this work
we are exploiting the above observations and the fact that the scan shift frequency can be changed
without affecting the quality of the test, in order to use scan shift frequency scaling as a method
of lowering the die temperature during test. The cost paid for the lower die temperature obtained
by scaling down the scan shift frequency is having longer test times, for example halving the shift
frequency will double the test length.
The proposed test scheduling algorithm is shown in Figure 1. The algorithm starts from the
set of cores (S) of the target system, the corresponding test compatibility graph (TCG) and the
maximum junction temperature that can be tolerated during test (Tmax). Each core is annotated
with the length of its corresponding test for a given default scan shift frequency (Freqinit). The
TCG captures the concurrency compatibility relationships between the system cores: each node in
the TCG corresponds to a core, and an edge between two nodes means that the two corresponding
cores can be tested concurrently without causing any resource conﬂicts. The algorithm returns
a thermal-safe test schedule as a list of test sessions and their corresponding scaling factors for
the scan shift frequency. Each test session in the test schedule is a group of cores to be tested
concurrently. It is assumed that all cores tested in the same test session share the same scan shift
clock, but this can vary from one test session to another.
The algorithm starts by computing all the cliques of the TCG and the clique with the longest test
length if its cores are tested sequentially is selected. Then the corresponding cores are assigned to
a test session TS (lines 4-8). Next, the scan shift frequency for TS (FreqTS) is set to the default
scan shift frequency (Freqinit) and a thermal simulation is carried out on TS in order to determine
33: Experimental results
Table 1 compares the performance of the proposed algorithm (columns 6-7) with the power con-
strained test scheduling approach presented in [7](column 3) and the thermal-safe test scheduling
approach with ﬁxed scan shift frequency presented in [15](columns 4-5). In our experiments, we
have used the benchmark designs from [7]. Details such as physical layout dimensions and real-
istic test power and time values needed to be added to the original design descriptions in order to
provide all necessary information for the proposed thermal safe test scheduling algorithm. Thermal
simulations were performed using the HotSpot tool [17].
The performance of the test scheduling algorithms is compared in terms of test schedule length
(columns 3, 4 and 6), and thermal simulation effort (columns 5 and 7). The second column shows
the thermal constraint Tmax used in each experiment. For each design, three values of Tmax were
used: 130 °C, 150 °C and the maximum temperature corresponding to the test schedule obtained
usingthepowerconstrainedtestschedulerthatwaspresentedin[7]. Thecaseswhereatestschedule
could not be generated for the given thermal constraint because of thermal violations are marked
with N/A. As it can be seen, the proposed solution is able to compute a thermal safe test schedule for
all designs and thermal constraints considered. For example, for the circuit muresan 20, both power
constrained test scheduling from [7] and the thermal-safe test scheduling approach from [15] fail
to meet the thermal constraints of 130 °C and 150 °C because the die temperature of certain cores
exceeds these value for the default scan shift frequency even when tested in a purely sequential test
schedule. Moreover, even in some cases where the ﬁrst two approaches can compute a thermal-
safe solution, the proposed approach generates a shorter test schedule. For example, for the design
muresan 20, the test schedule generated using the proposed approach is only 4.13 seconds long,
whencomparedwiththe5.69secondstestschedulegeneratedusingthepowerconstrainedapproach
presented in [7] and the 4.89 seconds test schedule computed using the thermal-safe test scheduling
approach presented in [15]. This is because, in some cases, the overall testing time gains due
to the increased test concurrency per test session obtained by scaling down the shift frequency
exceed the increase in test session length due to scaling. The downside of the proposed solution
is the increased thermal simulation effort. For example for system s, for a thermal constraint of
104.48 °C, the thermal simulation length required by the proposed approach is over 28 seconds,
when compared to less than 18 seconds required by the approach presented in [15].
The number of test sessions which requre scaling down of the scan shift frequency increases as
the thermal constraint Tmax is lowered. This is shown in Table 2 for the design muresan 20. The
5