Introduction
System designers are adapting to the system-on-chip (SoC) design methodology because of its efficiency compared to the traditional system-on-board approach. The main benefit of the SoC approach is that it can drastically shorten the design cycle by allowing pre-designed cores and their associated test set to be reused. The IEEE 1500 standard [1] supports this test reuse methodology by standardizing a test wrapper.
The use of SoC design methodology introduces several new problems and challenges in testing [2] . First, the cores that are embedded deep inside the silicon chip require a Test Access Mechanisms (TAM) for test data transportation. Several TAM architectures have been proposed such as TestRail [3] , Virtual TAM [4] , and TAMs based on transparency [5] . Second, the SoC's core-based design requires a mechanism to isolate the cores during test. This is achieved by the use of core wrappers [1] , [3] . Third, the cores can either be tested sequentially at the cost of longer test application time, or in parallel at the cost of larger area overhead and power dissipation. In this regard, various test scheduling solutions have been proposed [6]- [22] . The core test scheduling approaches proposed by [6]-[15] rely on a dedicated TAM. This extraneous TAM is consistently added to the SoC for the sole purpose of delivering the test vectors from external automatic test equipment (ATE) to the core under test (CUT).
Regardless of how efficient the test schedule optimization, wrapper optimization, or TAM optimization algorithms are, the idea of adding a dedicated TAM for test data transportation by itself requires considerable area overhead. In addition, the long TAM wires increase the routing congestion. With small feature sizes below 90nm, the long wires are highly potential spots of production defects. Defects in TAMs would prevent any cores from being properly tested, thereby affecting yield. To avoid relying on TAMs, the existing functional communication architecture should be used as an alternative to the extraneous TAM for testing purposes. The test scheduling methods proposed in [6]- [15] cannot be applied to this new problem, which has a single shared bus where each bus wire cannot be individually assigned according to the optimum test schedule. Cores can only be tested sequentially by using all functional bus bit width.
In order to maximally reuse the existing chip resources for testing, the authors in [16] described a method based on consecutive transparency, where test access paths are formed by creating transparent paths through existing functional connections between the SoC cores, thereby reusing most of the existing interconnects. However, the drawback of the proposed approach is that it is intrusive; sometimes, establishing the paths requires modification to the core internals, which might affect the critical paths of the cores.
by a brief technical overview in Sect. 4. In Sect. 5, the support architecture design for the efficient utilization of the functional bus during testing is described. Section 6 elaborates the methodology to develop an efficient test schedule using the functional bus. In Sect. 7, we thoroughly evaluate our methodology experimentally. Finally, a brief set of conclusions is offered in Sect. 8.
Related Work
The author in [17] discussed the functional bus based test application for SoCs. When utilizing the embedded processor as tester [18] - [20] , direct memory access (DMA) is used to transport test data from the external tester to the embedded memory [18] , after which the test data are transported to the CUT through the addressable system bus [19] . In [20] , the embedded processor is tested prior to core testing using software-based self-test [23] . The embedded memories are tested by either the processor or the embedded memory built-in-self-test (MBIST).
A hybrid TAM architecture which uses the existing functional bus in addition to extra TAMS was proposed in [8] . In this approach, the functional bus is converted into a bundle of TAMs, by adding some logic to make the bus wires controllable and observable to external testers. As a result, each functional bus wire can be individually controlled and assigned to cores for the test data transportation, similar to added TAM. To further minimize the test application time by optimizing TAM utilization, shared test vectors are broadcast to multiple cores.
In [21] , the authors propose 'a test interface architecture between PCI buses and CUTS. The CUTS are tested using pseudo-random test vectors, generated by the embedded processor. In [22] , a buffer interface between a functional bus and a CUT is proposed, while the control of test application is performed by a Finite State Machine (FSM) based controller. The hardwired controller has in [22] has two main weaknesses compared to our approach: (1) the area cost is proportional to the the volume of test data, and (2) the FSM-based test schedule is fixed, making it impossible to change. This flexibility is especially important during the hardware debugging stage. The test responses, on the other hand, are not transported through the bus to the test sink. Instead, they are compressed by local embedded multiple input signature registers (MISR), which could cause aliasing. Furthermore, the MISRs incur hardware overhead.
Due to the use of MISRs, the test application time of [22] should, in most cases, be shorter than our approach, which transports the test responses back to the tester. However, the use of software-based test program in our proposed approach has the advantage of being flexible, and does not incur hardware overhead other than the buffers. The role of the FSM-based test controller in [22] can be replaced by an external tester or an embedded processor. This is further discussed in Sect. 4 To differentiate the two types of test data transportation approaches, we define the following terminologies: Definition 1: Dedicated TAM is a set of dedicated wires that are added to the SoC for the test data transportation between an ATE and all the SoC cores.
Definition 2: Functional TAM refers to the existing SoC's functional interconnects which are transformed and reused for the test data transportation.
In this paper, we illustrate our power-constrained SoC testing approach which utilizes the functional TAM for the test data transportation. In order to take advantage of the functional TAM, we approach the problem from two angles, namely, a support architecture design framework and an algorithmic framework. In the process, we show how our approach greatly simplifies the test program, one of the primary strengths and differentiators of our proposed methodology. Such a simplification is attained through the support of an efficient test architecture, which includes appropriate timing control circuitry.
Motivation
Let us look at some of the possible scenarios regarding packet based test delivery utilizing the shared functional TAM. To ease description, let us denote each of the small test data units as a test packet. Figure 1 illustrate a sequence of events when test packet are transported between a tester and two CUTS, Core A and Core B, which are interfaced to the functional TAM through dedicated local buffers Buffer A and Buffer B, respectively, in Fig. 2 . Bus and Core A/B represent the activities on the functional TAM and at the CUT, respectively. Once a packet carrying test vector data, vt, (labeled V) is received by the local buffer, the test response packet (labeled R) from a previous test vector, vt_l is returned in the next time slot.
The Round-robin packet delivery schedule in Fig. 1 is a reasonable first attempt at scheduling the test delivery be- Fig. 1 Test data transportation using packet-based delivery on the functional TAM. A R packet carrying the test response data is returned to tester after every V packet, which carries test vector data. cause of its fair allocation of the bus. Figure 3 (a) shows a similar delivery pattern for three unidentical CUTS m1, m2, and m3. For each CUT mi, each packet will go through two separate stages of transfer (illustrated by Fig. 2) . First, it is delivered from a tester to the buffer through the functional TAM (labeled Bus in Fig. 3 , where each time slot represents both V and R packets). On the second stage, the packet is transferred from the buffer into the scan chains (labeled mi in Fig. 3 Figure 3 (b) shows CUT m3 idle, waiting for test data because the test packet for m3 cannot be delivered until the test packet for m2 has been delivered. However, the test packet for m2 cannot be delivered until the test application of the previous packet of m2 has been completed. Consequently, m3 is starved for test data and at the same time the bus remains idle while waiting for m2 to complete test application even though m3 needs test data. An analogous situation holds for m1. CUT m2, on the other hand, always receives its test data in a timely manner at the expense of starving m1 and m3.
The problem can be remedied by increasing the packet size for m1 and m3, as in Fig. 3 (b) . However, this quick fix implies that larger buffer spaces are required for m1 and m3 to store the larger packet sizes. We can reduce packet sizes for all cores, but the minimum packet size for each core is constrained by the core with the smallest packet size (i.e. m2). Further reduction in packet sizes for m1 and m3 would reintroduce the problem illustrated in Fig. 3 (a) . The communicates to the ATE and the CUT using the functional read/write transactions. The TIC relays the vector data packets from the ATE to the CUT and vice versa for the response packets. Several issues need to be resolved in order to utilize the programmable core as a test source/sink, which among others include testing of the programmable core itself, and loading and offloading of the test data to the programmable core. An external ATE (connected through a TIC) is used as the test source/sink, therefore, these issues regarding the programmable core are not addressed in this paper.
Test scheduling that enables the reuse of the functional bus as functional TAMs involves three steps. First, breaking the test set into subsets capable of efficiently utilizing the bus. Second, scheduling all tests for every core with the objective of test time minimization, and third, generation of a test program which will execute in real-time by the tester to perform test application for all the SoC cores. In order for the benefits of utilizing the functional TAM , for testing to outweigh its counterparts in the dedicated TAM approach, the test architecture needs to support the algorithmic framework, and vice versa. Otherwise, problems such as bus underutilization arise because of the required arbitration between different cores, and improper test schedules causing certain cores to starve of test data while other cores may be hogging the bus, resulting in prolonged test application time.
Test data overflow/underfiow is a particularly serious potential problem, unless the necessary timing synchronization or interlock between the tester and the cores under test is provided. The test support architecture should be designed to also provide timing synchronization, while minimizing the hardware overhead.
Test Support Architecture
Buffer based test architecture was proposed by [22] in order to enable concurrency of core-based testing using a shared functional TAM. Our general test architecture with buffers similar to [22] is shown in Fig. 2 with the corresponding test delivery and test application timing diagram similar to Fig. 3 . Figure 5 shows the detailed architecture of the interface between the functional TAM and the core through the functional bus protocol interface. Both the functional connections (solid lines) and design-for-testability (DFT) connections (dotted lines) are shown. The components shown in solid black shading are the proposed buffer-based DFT architecture. Boundary cells (BC) are added to the core PI/POs in order to isolate the core during test. The wrapper scan chains are formed by chaining the input BCs, internal scan chains (ISC), and output BCs, in that precise order. Bidirectional I/Os are treated similarly to the ISCs since the BC's scan inputs, and not functional inputs, are used. Bidirectional I/Os are not shown in Fig. 5 to avoid clutter During the test application, the test data are delivered to the input buffer and then scanned into the scan chains. At the same time, the test responses are scanned out and stored in the output buffer before being retrieved by the tester for analysis. The buffer consists of four main components-input register, output register, fall-through stack, and FIFO buffer controller-as shown in Fig. 6 (a) , illustrating the input buffer and the corresponding first-in first-out (FIFO) buffer controller. The output buffer (not illustrated) has identical structure as the input buffer but with reverse data flow. The input register latches data from the bus. Upon registering a full status bit for the input register, the top of the stack copies the data from the input register if its status , bit indicates that it is empty. After copying, the input register status bit is cleared, preparing it for the next cycle of data from the bus. The stack will subsequently go through the fall-through stages which will bring the data to the lowest empty slot.
The output register is composed of sm bits, where sm is the number of wrapper scan chains for core m, possibly differing from the bus width, wb. It is interfaced directly to the scan chain inputs. not applicable to embedded cores that are not directly accessible from the bus. One possible way of mitigating this is by introducing bypass interconnects between the core and the bus; this issue is out of the scope of this paper therefore not discussed. In Sect. 6, the scheduling of test vectors and responses transportation for the embedded cores is discussed. It is assumed that the FIFO buffers and controllers are fault-free, therefore not the target of testing. The test of these DFT architectures can be either done in an integrated fashion, or independent of the core tests (i.e. in priory). In this paper, the latter is assumed.
Packet Delivery Scheduling Algorithm
In this section, the issues related to packet delivery scheduling discussed in Sect. 3 are addressed. The packet delivery schedule that minimizes the test application time is developed with two objectives:
1. Minimization of the total required buffer size. 2. Maximization of bus utilization.
The above objectives are sought while at the same time ensuring that all cores receive the test data in a timely manner. In order to satisfy these twin objectives, the buffer size for each core and the test delivery sequence need to be optimal.
The scheduling algorithm consists of two hierarchical steps. The first step (described in Sects. 6.2 and 6.3) is the grouping of cores which can be tested simultaneously under a maximum power constraint. In the second step (defined in Sects. 6.4 and 6.5), for each group of cores, the optimum number of packets (and the corresponding packet size) for every core is determined. Each of these packets is then scheduled for delivery through the functional TAM.
In this section, the algorithmic framework is discussed in terms of the two hierarchical steps above. We start by defining a set of nomenclature useful in describing the methodology. The packet set scheduling algorithm consists of three steps. First, to determine how to split the test packet for each core (i.e. finding the split ratios) so that the individual packet sizes are equal. If the packet sizes are not equal, the largest packet will become the constraint when minimizing the total buffer sizes as illustrated in Sect. 3. In the second step, once the split ratio has been identified, the packet sizes are determined by solving a set of linear equations. In the third step, a sequence of packet set delivery schedules is systematically formed.
Terminology
6.5. 1 Step 1
Let us consider a test group which has n cores to be tested simultaneously. In the first step of the algorithm, all k<n cores with scan rates smaller than the average scan rate for all cores are considered to have a split ratio of one-the smallest split ratio. This is because other larger cores will be assigned split ratios of larger than or equal to one. Under the PASS scheme, the smallest possible number of packets is desirable when forming a packet set in order to minimize the complexity of the resulting test program. Before proceeding, we define a relevant terminology to aid the description of the algorithm.
Definition 9: Assuming that the bus delivery rate is sufficiently high, a packet set is considered to be in perfect fit if (i) it does not have cores that are waiting for test data, (ii) there are no two consecutive packets delivered that belong to the same core and (iii) the number of packets between adjacent split-1 packets are equal. Furthermore, all three conditions need still hold when two adjacent perfect-fit packet sets are cascaded, except possibly for the initial or final legs of test application. Figure 9 shows a perfect-fit delivery sequence, where the test group consists of nine cores, m1 to m9. Between the four split-1 cores (m1-m4), eight packets belonging to other cores (m5-m9) are delivered (perfect-fit condition (iii)). In order to ensure the perfect-fit criteria are not violated when forming a perfect-fit packet set consisting of split-1 To preclude introduction of gaps between the test applications of two consecutive packets of a core (condition (i) of Definition 9) as illustrated by Fig. 3 (b 6.5.3
Step 3
In step one, the cores are assigned to either split-1, split-r, or split-2k groups. Once the split ratios are determined, the complete packet set schedule that fulfills conditions (ii) and (iii) of Definition 9 can be systematically represented by Fig. 11 , assuming k and q cores for split-1 and split-2k Table 1 shows the frequency information for dedicated TAM approaches and two variations of our approaches, PASSa and PASSb, with distinct bus frequencies. The scan frequency, fs, is set to the assumed maximum, Fs=100MHz; therefore all the dedicated TAM-based TATS [9]-[12] are divided by 105 to convert from the number of clock cycles to time (millisecond). The bus frequency, fb, for PASSb is double that of dedicated TAM-based and PASSa approaches (but less than the maximum bus operating frequency, Fb) to illustrate the benefit of our bufferbased approach.
In Table 2 , the TATs for [9], [11], [12] are all equal at three Pmax values for h953 circuit. In our approach, relatively similar results were obtained. No noticeable improvement was achieved when increasing the bus frequency (PASSb). These steady results were due to a single dominant core, m1, that constrains the TAT minimization for this circuit. Figure 13 shows plots of the TAT for different bus widths. For 64-to 128-bit bus, the TAT is constrained by the largest core; therefore, increasing bus widths has no significant effect on test application time. However, for bus widths between 12 and 48 bits, PASSa delivers improvements of 4.8% and 18.2% over [10] for both maximum power, Pmax values of 3,000 and 10,000 for p22810. PASSb is improved by 25.9% to 47.8% when test data delivery time is the limiting factor. Similar trends can be observed for p93791 in Fig. 13 (c) and 13 (d) . In fact, our test methodology delivers marked improvements in reducing test application time for smaller bus widths.
For d695 (Table 3) , our approach proves to be highly effective, even for the same bus frequency as [10] , [12] , at all power levels for bus widths ranging from 32 to 80 bits. For 96-bit and wider buses, our methodology though fails to perform as well. It is interesting to note, however, that the dedicated TAM-based approach requires quite elevated levels of TAM overhead in order to outperform our packet scheduling approach using the functional TAM. In Table 4 , some performance comparisons with several scheduling approaches is given. The second row shows the TAT for the dedicated TAM-based scheduling without considering power and hierarchy constraints [14] . The TAT is bounded by the lower bound of TLB=10.2ms [14] . When the design hierarchy is considered as a constraint, the resulting TAT [10] is shown on the third row. On rows 4-6, the TAT of a test set sharing and broadcasting approach is given. When using only the functional TAM (fourth row), the TAT is 27% higher than the case of using only the dedi- isting functional bus for testing purposes. The utilization of the functional bus for powerconstrained core-based SoC testing entails a number of challenges. These include frequency and bit-width mismatch between the bus and the cores under test, allocation of bus time slots for an efficient test data delivery schedule that maximizes bus utilization and that ensures that all cores always have the test data that they need to continue testing simultaneously without exceeding the power constraint.
We have herein proposed an efficient methodology that overcomes all of these challenges through a test support architecture design framework and an algorithmic design framework. The proposed methodology offers a solution that also minimizes the size of the test program. The experimental data clearly showcases the benefits of the proposed methodology in reducing test application time especially for smaller bus widths, while also eliminating the need to add extraneous TAMS to the SoC solely for testing purposes.
