Abstract-Motivated by the increasing design for test (DFT) area overhead and potential performance degradation caused by wrapping all the embedded cores for modular system-on-a-chip (SOC) testing, this paper proposes a solution for reducing the number of wrapper boundary register (WBR) cells. By utilizing the functional interconnect topology and the WBRs of the surrounding cores to transfer test stimuli and responses, the WBRs of some cores can be removed without affecting the testability of the SOC. We denote the cores without WBRs as light-wrapped cores and present a new modular SOC test architecture for concurrently testing both the wrapped and the light-wrapped logic cores. Since the WBRs of cores that transfer test stimuli and test responses for light-wrapped cores become shared resources during test, conflicts arise during test scheduling that will negatively impact the test application time. As a consequence, to alleviate this problem, we present a novel test access mechanism (TAM) design algorithm for the proposed SOC test architecture. We conduct experiments on several SOC benchmark circuits and demonstrate that, with an acceptable increase in test application time, the number of WBRs can be significantly decreased. This will ultimately lessen the necessary DFT area for modular SOC testing and reduce the propagation delays between cores.
I. INTRODUCTION

S
YSTEM-ON-A-CHIP (SOC) design using reusable intellectual property (IP) cores has become a state-of-the-art implementation paradigm that has triggered novel business models based on IP core providers and system integrators [7] . The IP cores are predesigned and preverified by the core providers; however, SOC composition is the system integrator's duty, who is also in charge of verification and manufacturing test of the entire SOC, including the IP-protected internal cores. Although the IP core reuse reduces the design cycle, the rapid increase in SOC complexity makes the test development a major implementation bottleneck [29] . This bottleneck is caused by the increasing number of internal cores, which cannot be tested easily since they are not directly accessible from the primary inputs (PIs) and primary outputs (POs). Various solutions for exploiting the SOC's architecture-specific information and using functional interconnect as test access mechanisms (TAMs), either at the core or at the system level, have been proposed [2] , [3] , [15] , [21] , [22] , [28] . Regardless of their potential benefits in the long term, unless implemented automatically using a reliable test tool flow, these architecturespecific design for test (DFT) methodologies do not provide reusability, scalability, interoperability, and may become the computational bottleneck in the test automation flow. This problem is overcome by modular test strategies [29] , which use dedicated bus-based TAMs for test data transportation. However, to enable both core reuse and easy test access, the embedded cores are connected to TAMs using special interfaces called core wrappers. Therefore, when the number of cores or the number of the core's terminals increases, the area introduced by core wrappers will also grow, which in turn adds to the overall cost of test. To address this issue, the main objective of this paper is to investigate how can the wrapper count be reduced while maintaining the benefits of modular SOC testing.
The rest of the paper is organized as follows. Section II reviews related work on modular SOC testing and motivates the research presented in this paper. Section III introduces lightwrapped cores that can facilitate a decrease in test resource usage necessary for modular SOC testing. In Section IV, we present a novel SOC test architecture with reduced wrapper count and provide the corresponding wrapper/TAM cooptimization algorithms. Section V describes our experiments and Section VI concludes this paper.
II. PRIOR WORK AND MOTIVATION
This section overviews prior work on wrapper design and test architectures, and motivates the proposed research work.
A. Embedded Core Wrapper Design
An overview of the standard IEEE P1500 wrapper is shown in Fig. 1 . Its main purpose is core isolation during test and it has three main modes of operation [20] : 1) functional operation, in which the wrapper is transparent; 2) inward-facing test mode (INTEST), in which test access is provided to the core itself; and 3) outward-facing test mode (EXTEST), in which test access is provided to the circuitry outside the core. The wrapper has a mandatory 1-bit input/output pair, wrapper serial input (WSI) and wrapper serial output (WSO), and optionally one or more multibit input/output pairs, wrapper parallel input (WPI) and wrapper parallel output (WPO). The wrapper also comprises wrapper boundary register (WBR) cells to provide controllability and observability for the core terminals and wrapper bypass register (WBY) cells to serve as a bypass for the test data access mechanism. In addition, the wrapper has a wrapper serial control (WSC) port and an internal wrapper instruction register (WIR) used to control the different operational modes of the wrapper. It is important to note that IEEE P1500 standard for embedded core test standardizes only the 0278-0070/$20.00 © 2005 IEEE Fig. 1 . Overview of IEEE P1500's wrapper architecture [20] .
wrapper interface. Hence, the internal structure of the wrapper can be adapted to the specific SOC requirements.
Marinissen et al. [16] described a scalable core wrapper called TestShell, which forms the basis for the IEEE P1500 core wrapper [20] . The interconnection of internal scan chains and wrapper cells to the external TAM lines determines the construction of wrapper scan chains (WSCs). Since the test application time (TAT) of a core is dependent on the maximum WSC length, the main objective in wrapper optimization is to build balanced WSCs. Marinissen et al. [17] addressed this problem by describing a COMBINE heuristic for hard cores. Later, Iyengar et al. [9] proposed the Design_wrapper algorithm based on the best fit decreasing heuristic for the bin packing problem, which tries to minimize the core's TAT and required TAM width at the same time. They also showed an important feature of wrapper optimization for hard cores, i.e., TAT varies with TAM width as a "staircase" function. According to this feature, only a few TAM widths between 1 and W max (the maximum number of TAM width) are relevant when assigning TAM resources to hard cores, and these discrete widths are called pareto-optimal TAM widths.
B. SOC Test Architectures
Three basic types of SOC test architectures have been described in [1] : 1) the multiplexing architecture; 2) the daisy chain architecture; and 3) the distribution architecture. In the multiplexing and the daisy chain architecture, all cores get full access to all TAMs, while in the distribution architecture the total TAM is distributed over all cores. Two popular architectures that support more flexible test schedules are proposed based on the above architectures: The Test Bus architecture proposed in [24] can be seen as a combination of the multiplexing and the distribution architecture. While the TestRail architecture proposed in [16] is a combination of the daisy chain and the distribution architecture. Based on the TAM lines assignment strategy, the above modular test architectures can be further categorized into fixed-width test architectures and flexible-width test architectures. A vast body of research has been carried out for both types of architectures, and only a few representative approaches are summarized next.
For fixed-width test architectures, Iyengar et al. [9] first formulated the integrated wrapper/TAM cooptimization problem and broke it down into a progression of four incremental problems in order of increasing complexity. An integer linear programming (ILP) model was then presented to solve the problem. To decrease the CPU running time, the same authors combined efficient heuristics and ILP methods in [10] . Koranne [14] formulated the test scheduling problem as a network transportation problem and presented a two-approximation algorithm to solve this problem. While the above approaches concentrate on Test Bus architecture, [4] presented an efficient heuristic TR-Architect that works for both Test Bus and TestRail architectures. In [5] and [6] , TR-Architect was extended to account for the wire length cost and test control, respectively.
For flexible-width test architectures, Huang et al. [8] first mapped the test architecture optimization to the well-known two-dimensional bin packing problem and proposed a heuristic method based on the best fit algorithm to solve it. Iyengar et al. [12] presented an improved heuristic for the rectangle packing problem, when cores are supplied with fixedlength scan chains. Next in [11] , the same authors extended their algorithm to incorporate precedence and power constraints while allowing a group of tests to be preemptable, while in [13] , they considered minimizing the tester buffer reloads and multisite testing. Zou et al. [30] used sequence pairs to represent the placement of the rectangles, borrowed from the placeand-route literature, and then employed a simulated annealing technique to find an optimal test schedule.
C. Motivation and Summary of Contributions
The previously mentioned solutions for the design of core wrappers and test architectures [1] , [4] - [6] , [8] - [14] , [16] , [17] , [24] , [30] assume that all the cores attached to the TAM wires are fully wrapped, i.e., WBRs are placed on all the functional input and output terminals. While this guarantees core isolation during test, and hence high test quality, some embedded cores may have high pin count, and consequently the DFT area overhead associated with the wrappers will increase the cost of the test. Moreover, since both core's inputs and outputs Fig. 2 . Wrapper boundary cell for (a) core input and (b) core output terminal [20] .
are buffered in the wrapper, at least two sets of multiplexers ( Fig. 2 shows simple implementation examples of P1500 wrapper boundary cells) are required to switch between the functional and test modes of operation. If placed on the critical paths, these multiplexers will lower the maximum operating frequency, thus having a direct impact on the SOC's performance. An emerging challenge is to find ways of avoiding the performance penalty without affecting the test quality. To solve this problem, an approach based on partial isolation rings was proposed in [23] . Despite avoiding the high number of multiplexers, the main limitation of this solution lies in its computational complexity. This is because prior to deciding which input/output wrapper cells need to be inserted or removed, an analysis needs to be performed to check whether each test vector can be functionally justified. Therefore, the extensive usage of automatic test pattern generation (ATPG) for this analysis will reduce the methodology's scalability and reusability. Furthermore, the dependence of the wrapper cell removal methodology on the test set at hand will also limit the applicability of additional diagnosis data since the inserted DFT hardware will support only the preanalyzed test set.
To lower the DFT area and performance overhead by reducing the number of WBR cells, we believe the main challenge lies in finding a solution that will not only maintain the test quality but also be compatible with the IEEE P1500 standard [20] and, at the same time, preserve the modularity and scalability of the existing tool flows for TAM design and test scheduling. Consequently, the aim of this paper is to investigate the suitability of reusing the functional interconnect for transferring test data to cores whose input and output WBR cells are removed. The main contributions of this paper can be summarized as follows:
• From the system integrator's standpoint, to test the embedded cores and their interconnects, full controllability and observability need to be provided at the inputs and outputs of each core. To ensure the modularity and scalability of an SOC test methodology, the controllability and observability of each embedded core should be test set independent. To achieve this, it is not necessary to wrap all the core's terminals with WBR cells, since the system integrator can also exploit the functional interconnect between cores to transfer the test data. To illustrate this observation, producers and consumers are introduced. For a given Core i , its producers are the cores that feed its PIs and its consumers are the cores that capture its POs in the normal (functional) mode. Fig. 3 shows a part of an SOC, where Core 3 is not wrapped with WBR cells; however, all its producers (Core 1 , Core 2 ) and its consumer (Core 4 ) are P1500-wrapped. For INTEST of Core 3 , the controllability of its input terminals is provided through its producers' output WBR cells while the observability of its output terminals is provided through its consumer's input WBR cells. In other words, we can shift in its test stimuli through the output WBR cells of Core 1 and Core 2 , feed in the test stimuli into Core 3 through its normal functional path, and then capture its test response and shift it out through the input WBR cells of Core 4 . Note, Core 3 cannot be tested using the EXTEST of Core 1 , Core 2 , and Core 4 because the state of the internal scan chain in Core 3 cannot be controlled and observed in EXTEST mode. Since we apply test stimuli and capture test responses through functional paths, all the interconnects are tested implicitly, and hence we do not need to perform EXTEST for Core 3 . It should be noted, however, that this implicit testing of interconnects loses the diagnostic information that differentiates between defects in Core 3 's interconnects and defects in Core 3 's internal logic. In this model, a P1500-wrapped core can serve as a producer and consumer at the same time because there are no test resource conflicts for using its wrapper output cells as a producer to shift in test stimuli and its wrapper input cells as a consumer to shift out test responses at the same time. To summarize the above-explained observation, a core can be tested without wrapping its terminals as long as all its producers and consumers are P1500-wrapped. If the core does not have other test modes except the INTEST and EXTEST modes, then it does not need a wrapper at all [ Fig. 4 (a)]; however, the producers and consumers must be updated with a P1500-compliant wrapper [Fig. 4(b) ] to support the proposed test strategy. If the core has other test modes, for example, it contains RAM or ROM blocks and has an additional built-in self-test (BIST) mode to test these internal memories, then, in addition to updating the producers' and consumers' wrappers, the core under test (CUT) needs a light wrapper without WBRs to support these additional modes, i.e., the light wrapper must include WIR and the WSC port to control the operational mode of the core. From now onwards, light-wrapped cores will refer to cores that do not need a wrapper at all or cores with a light wrapper, since both of them remove all the wrapper cells, which in turn reduce the DFT area and may improve the SOC's performance. The light-wrapped core requires either WSI/WSO or WPI/WPO to shift in the test stimuli and shift out the test responses to and from its internal scan chains. It may also include a serial or parallel bypass register (WBY) to enable a shortened test access path to other light-wrapped cores, if necessary. It is interesting to note that, if the light-wrapped core does not have internal scan chains, it can be treated as a UDL and the proposed test strategy for it is in essence a parallel EXTEST strategy.
In addition to the DFT hardware modification to support light-wrapped core testing, the P1500 instruction set also needs to be extended. New instruction LOADPROD for producer cores and UNLOADCONS for consumer cores are introduced. Moreover, if a core serves as both producer and consumer at the same time, an additional LDUNLDNEIGHBOR instruction is required to transfer test data both in and out of its WBR cells. These instructions are used to set the producer/consumer in the appropriate operational mode to shift in/out test stimuli/ responses.
IV. NEW SOC TEST ARCHITECTURE AND TEST SCHEDULING
Having introduced the light wrapper concept and outlined its applicability to P1500-based testing, this section focuses on its implications on SOC test architecture and test scheduling. Note, this paper does not address directly the design of hierarchical TAMs and the SOC hierarchy is assumed to be flattened. In addition, in this paper, we do not consider test scheduling constraints introduced by precedence relationship, preemption, and power. The introduction of the above features in the proposed SOC test architecture and test scheduling requires a separate investigation and consequently can be the topic of another study.
To clarify all the issues related to testing these light-wrapped cores, we provide a hypothetical SOC, called m4953, with nine cores and a system bus connecting three cores. The number of scan chains n sc and the functional interconnects of these cores are shown in Fig. 5 .
1 Note, the test infrastructure of the SOC has not been implemented yet and hence is not shown in the figure. Additional test parameters will be given in the experimental section. The name of this SOC follows the benchmark naming convention presented in [19] , where m refers to McMaster University and the number 4953 denotes the test complexity.
A. Test Conflicts Caused by Sharing Functional Interconnect and Producers/Consumers
Before proposing a new SOC test architecture, we analyze the conflicts introduced by light wrappers. In the INTEST mode, all the P1500-wrapped cores can be tested concurrently as long as they use different TAM lines (assuming cores on the same TAM are tested in sequential order). However, because testing light-wrapped cores is dependent on their producers and consumers, TAM line conflicts are not the only ones that limit the test concurrency. Instead, there are five new types of test conflicts, described as follows.
Producer-CUT Conflict: Producer(s) and the CUT cannot be tested at the same time. For example, in Fig. 5 , if Core 6 is a light-wrapped core, Core 2 , Core 5 , and Core 9 should not be tested at the same time as Core 6 . This is because the producer needs to utilize its output WBRs to capture its test responses; however, at the same time, the CUT needs the producer's output WBRs to provide test stimuli. If they are tested concurrently, the test data will be corrupted. 
CUT-Consumer Conflict:
The CUT and consumer(s) cannot be tested at the same time. For example, in Fig. 5 , if Core 2 is a light-wrapped core, Core 6 , Core 7 , Core 8 , and Core 9 should not be tested at the same time. This is because the consumer needs to utilize its input WBRs to deliver test stimuli; however, at the same time, the CUT needs the consumer's input WBRs to capture the test responses. If they are tested concurrently, the test data will be corrupted.
Shared-Producer Conflict: Two light-wrapped cores that connect directly (i.e., on a dedicated nonshared set of lines) to the same producer cannot be tested at the same time. For example, in Fig. 5 , if Core 7 and Core 8 are light-wrapped cores, they cannot be tested at the same time because both of them require the output WBRs of Core 2 to provide the test stimuli.
Shared-Consumer Conflict: Two light-wrapped cores that connect directly to the same consumer cannot be tested at the same time. For example, in Fig. 5 , if Core 3 and Core 6 are light-wrapped cores, they cannot be tested at the same time because both of them need the input WBRs of Core 5 to capture the test responses.
Shared-Bus Conflict: If the producer(s) or consumer(s) connects to the light-wrapped CUT through functional buses, they can imply the previous described test conflicts and hence may not be tested at the same time. For example, in m4953, if Core 1 and Core 5 are two light-wrapped cores connected to the system bus, they cannot be tested at the same time because both of them need the I/O WBRs of Core 8 to provide test stimuli or capture the test response at the same time. However, if we have another wrapped core connected to the bus, for example, Core 4 , then Core 1 and Core 5 can be tested together because we can use Core 4 as the producer and consumer of Core 1 , and Core 8 as the producer and consumer of Core 5 . By sharing the bus lines in consecutive times, there is only one clock cycle test application penalty per test pattern (using the same system bus to transfer test data), which is insignificant for scan-based testing.
B. TAM Division Into Three Groups: Producers, CUTs, and Consumers
The previous section has outlined the test conflicts that, if not taken into consideration, may corrupt the test data and render the test useless. Other types of conflicts may appear if the test data are transferred using shared TAM lines between producers, CUT, and consumers. To avoid this type of conflicts, which may adversely influence the overall TAT of the SOC, dividing the TAM lines into three groups is proposed, motivated by the following examples.
Example 1: Consider the SOC m4953 shown in Fig. 5 and let us assume that Core 1 and Core 2 are P1500-wrapped cores and Core 6 is a light-wrapped core, which needs Core 2 as a producer. We assume that Core 1 and Core 2 share the same TAM lines (TAM w1 ) and Core 6 connects to a different TAM (TAM w2 ). Since for testing Core 6 we need to use both TAM w1 and TAM w2 to transfer test data, loading a test pattern for Core 1 is prohibited while loading the stimulus for Core 6 . As a result, there is a test conflict between Core 1 and Core 6 even though they connect to different TAM lines and have no functional relationship. This indirect TAM resource conflict may prohibit the overall test concurrency for light-wrapped cores in a large SOC, which will ultimately lead to testing all the light-wrapped cores separately, and thus resulting in very long testing time.
Sharing TAM lines between producers, CUTs, and consumers may also increase the test control complexity, as illustrated in the following example.
Example 2: In the case of m4953 shown in Fig. 5 , if Core 2 is a light-wrapped core, then after the test stimuli are loaded in the output WBRs of Core 1 and Core 4 , we must apply them at the same time. We also need to capture the test response in the input WBRs of Core 6 , Core 7 , Core 8 , and Core 9 at the same time before shifting it out. If the TAM lines are shared between producers, CUTs, and consumers, all of these operations introduce additional synchronization issues and consequently they may increase not only the testing time but also the test control complexity.
If there are only a very small number of light-wrapped cores in the SOC, then using WSI/WSO to load/unload the test stimuli/responses may be a neat solution. After testing all the P1500-compliant cores, we can test these few light-wrapped cores one by one by putting its producers and consumers into the EXTEST mode and shift in/out its internal scan chains to/from WPI/WPO. However, if the number of light-wrapped cores is large, then the 1-bit TAM provides limited bandwidth for producers and consumers, and hence it will considerably increase the overall TAT of the SOC. When the number of light-wrapped cores is high, we propose a division of the TAM lines into three groups: G prod , G CUT , and G cons used to load the producers, CUTs, and consumers, respectively. This division will remove the additional test conflicts discussed in Example 1 and test control complexity discussed in Example 2. Using the setup from Example 1, testing Core 6 will need the assistance of Core 2 to provide the test stimuli. If the output WBRs of Core 2 are loaded through G prod and, although Core 1 and Core 2 share the same TAM resources in G CUT , then Core 1 can still be tested at the same time as Core 6 .
For the G CUT group, we use a flexible-width test architecture, as introduced in Section II. For the G prod and G cons groups, however, we use the daisy chain architecture [1] , i.e., long scan chains are constructed over all the producer cores' output terminals and all the consumer cores' input terminals, as depicted in Fig. 6 . Producer bypass registers and consumer bypass registers (PBY and CBY in the figure) are introduced in order to shorten the loading/unloading time because only a few cores serve as producers or consumers at a specific test session. The main reason for using the daisy chain architecture for G prod and G cons group is to simplify the control complexity. When a producer (consumer) core is in LOADPROD (UNLOADCONS) mode, the producer (consumer) TAM lines go through the core's wrapper boundary cells, otherwise they go through the bypass register (note, it is unnecessary to introduce extra bypass instruction for producers and consumers). As a result, although testing light-wrapped cores involves several producers and consumers, they can be controlled by the LOADPROD and UNLOADCONS instructions independently. In addition, the daisy chain architecture for G prod (G cons ) TAM groups can almost always give a near optimal loading (unloading) time for a given TAM width W prod (W cons ). Suppose the number of the outputs of a producer is N o , then its loading time will be N o /W prod . As long as W prod ≤ N o (which is realistic in most of the cases), there is no waste for G prod TAM resources, except the few bypass cycles, which leads to a near optimal loading time for its producers. The same holds for G cons unloading. It is essential to note that the TAT of a light-wrapped core is dependent on all the three TAM groups' architectures and the proposed TAM division into three groups facilitates concurrent testing of P1500-wrapped and light-wrapped cores, which is exploited by the algorithms described in the following section.
C. Proposed Algorithms for Wrapper/TAM Cooptimization
The introduction of light-wrapped cores, producers, consumers, and TAM division into three groups requires the development of new algorithms for wrapper/TAM cooptimization, as explained in this section. First we formulate the new problem to be solved.
Problem P LWT−opt : Given the test set parameters for each core (including the number of PIs, POs, bidirectional I/Os, test patterns and scan chains, and each scan chain length), the total TAM width W ttl for the SOC and the wrapper design constraints C w determine the width of each TAM group (W prod , W CUT , and W cons corresponding to G prod , G CUT , and G cons ), the TAM width and the wrapper design for each core, and a test schedule for the entire SOC such that 1) the wrapper design constraints C w are satisfied; 2) the total number of light-wrapped cores is maximized; 3) the total number of TAM lines used at any time does not exceed W ttl ; and 4) the overall SOC TAT is minimized.
There are mainly three types of wrapper design constraints C w . 1) If the critical paths appear between cores, then to avoid performance penalty some cores must be light wrapped. 2) If some of the cores are provided with P1500 wrappers and, due to their location and size, their overhead does not affect the performance or the cost of the SOC, then there is no reason to make them light wrapped. 3) If a core is two-pattern tested (e.g., targeting delay faults or CMOS stuck-open faults) as in [27] , which employs the producers' WBR cells to apply the second consecutive pattern, double buffering in the WBRs of the core and its producers is necessary; hence, in this case, the CUT and all its producers must be P1500 wrapped. It is important to note that, if there are UDL blocks in the SOC, the system integrator has two choices: either make the UDL blocks P1500 wrapped and then use them as an input to the algorithms described in this section for problem P LWT−opt (this will guarantee that the maximum number of core wrappers is removed) or treat the UDL blocks as light-wrapped cores, i.e., they must satisfy the first wrapper design constraint when solving P LWT−opt . In either case, there is no loss in fault coverage of UDL since, by construction, it is ensured that each light-wrapped core is controlled by its producers and observed by its consumers. Therefore, the proposed solution can also be used as an alternative to EXTEST for concurrently testing wrapped cores and UDL.
In the rest of this section, we first present the top level algorithm for solving P LWT−opt and then give details on the new procedures and concepts specific to our approach.
TAM Division and Test Scheduling:
The proposed algorithm TAM_Division_And_Schedule to solve P LWT−opt is shown in Fig. 7 . The inputs are the set of cores (C set ), TAM width (W ttl ), functional interconnect relationship between cores (R), wrapper design constraint (C w ), and a weight parameter (weight), used in pruning the search space. The outputs are the number of TAM lines allocated to each TAM group, wrapper type (wrapper_type) and design for each core, SOC test schedule (schedule), and the overall test application time for the entire SOC (TAT soc ). The optimal TAM division, i.e., the combination of W prod , W CUT , and W cons that gives the minimum TAT of the SOC, is acquired through enumeration. The enumerative algorithm begins by determining which cores must be light wrapped (line 1), according to the functional interconnect relationship R and predefined wrapper design constraint (C w ) of the SOC. Based on the generated wrapper type (light wrapped or not) for each core and the test conflicts determined by the functional interconnect relationship between cores (R), a test incompatibility graph (TIG) is created (line 2). Next, the algorithm will enumeratively find the optimal TAM division and the minimum system TAT TAT soc . In the inner loop (lines 5 to 9), the local minimum TAT localmin for a fixed total width of W prod + W cons (W prod_plus_cons ) is computed.
In the outer loop (lines 3 to 13), the algorithm searches for globalmin, among the localmin values, by enumerating W CUT from the maximum possible value W − 2 to 1. During our initial experiments, it was observed that localmin is nearly a convex function with respect to W CUT . That is, it keeps decreasing until it reaches a local minimum value, at which point it starts increasing. This convex attribute can be explained by the fact that when W CUT has a small value, the TAT is dominated by the time to transfer test data through G CUT [for justification, see (1) and (2) explained later in this section]. Increasing W CUT , and hence decreasing W prod + W cons , will finally break this bottleneck. TAT starts to increase when the time required to load/unload the producer/consumer output/input WBR cells starts to dominate the scan time for G CUT . There are some variations around the local minimum value, which can be justified by the heuristic nature of the dynamic rectangle packing algorithm (explained later in this section). Hence, to prune the search space, we enumerate the localmin values in the opposite direction (i.e., from W − 2 to 1), since we want to discard the large localmin values. To accommodate the variations around the minimum value, we use a parameter weight (a real value slightly greater than 1) (lines 10 and 11). It should be noted that during the enumeration process, we do not need to do TAM design for the G prod and G cons groups, since the implementation of the daisy chain architectures for these two groups is straightforward once W prod and W cons are determined. To generate a TAM design for G CUT , we adapt an existing generalized rectangle packing algorithm TAM_Schedule_Optimizer [12] . Due to the usage of the daisy chain architecture for producers/consumers, a dynamic adaptation of this existing algorithm is necessary. We elaborate on each of the main steps of the top-level algorithm in the following paragraphs.
The worst case complexity of the algorithm TAM_Division_ And_Schedule is O(W 2 ttl × C(ART)), where C(ART) is the worst case complexity of algorithm Adapted_TAM_Schedule_ Optimizer, which will be detailed at the end of this section.
Decide Wrapper Type: Not all the cores need to be P1500 wrapped in an SOC; however, to provide full controllability and observability, each light-wrapped core needs to be surrounded by P1500-wrapped cores, i.e., all its producers and consumers must be wrapped. The pseudocode for deciding the wrapper type is shown in Fig. 8 . The algorithm takes the set of cores C set , the functional interconnect relationship R, and the wrapper design constraints C w as the inputs, and it outputs the wrapper type for each core i ∈ C set . First, the cores that need to be wrapped by P1500-compliant wrappers according to the direct functional relationship (i.e., dedicated nonshared communication lines) are identified (lines 1 to 10). In the first loop (lines 1 to 6), we initialize the wrapper status and wrap the cores according to wrapper constraints, if any. For all the other cores, the wrapper is first set to a light-wrapped type and a variable called test_dependency is initialized to the sum of its unwrapped producers and consumers (note, if one core serves as both a producer and a consumer for another core, it is not to be counted twice). This variable is used to indicate the core's test requirements as a light-wrapped core; if this number is large, it means that when this core is light wrapped, we need a large number of P1500-wrapped neighbor cores to test it. For example, in the case of m4953, Core 2 has two producers (Core 1 and Core 4 ) and four consumers (Core 6 , Core 7 , Core 8 , and Core 9 ), hence its test_dependency is initialized to 6. If Core 2 is a light-wrapped core, we need to wrap all its six neighbors. As a result, it is better to wrap Core 2 with a P1500-compliant wrapper. Therefore, the algorithm finds the cores with a large test_dependency and wraps them as P1500-compliant (lines 8 and 9). Whenever a core is decided to be wrapped as P1500-compliant, its test_dependency is set to 0 because it does not require any other cores to facilitate its test; the test_dependency of all its light-wrapped producers/ consumers is deducted by 1 (line 10). When functional busses are used, at least one core on each functional bus must be wrapped as P1500-compliant to test all the other light-wrapped cores on the bus (lines 11 to 13). The algorithm will find a core with the least number of inputs and outputs to wrap in order to decrease the time required to load/unload the test stimuli/responses. To illustrate the outcome of the proposed algorithm, in the case of m4953, Core 3 , Core 6 , Core 7 , and Core 8 are selected to be light wrapped, as shown by the shaded boxes in Fig. 5 .
Construct the TIG:
If there are test conflicts between two cores (see Section IV-A), then these two core tests cannot be scheduled at the same time and are denoted as incompatible cores. We construct a TIG by treating each core as a node and adding an edge between two nodes if they are incompatible. This TIG is used in Algorithm 3 (Adapted_TAM_ Schedule_Optimizer). The TIG generated for m4593 is shown in Fig. 9 . As shown in the figure, edges illustrating incompatibility can exist only between two light-wrapped cores or between a light-wrapped core and its producers/consumers. Two P1500-wrapped cores are always compatible during test because they do not need each other's help to test their internal logic.
Dynamic Rectangle Representation: For a P1500-wrapped core, if the assigned TAM width in G CUT is given, the TAT to apply the entire test set T p is determined by (1) [17] , in which s i (s o ) is the longest wrapper scan-in (scan-out) chain for the core and p is the number of test patterns. When using the Design_wrapper algorithm [9] for wrapper optimization to build balanced WSCs, s i (s o ) has a fixed value for a given TAM width, and hence the core test can be represented as a static rectangle
However, for a light-wrapped core, its TAT T l does not only depend on the time to load/unload its own producers (L prod ), consumers (L cons ), and internal scan chains (L in ); T l also depends on the time necessary to load/unload all the concurrently tested light-wrapped cores' producers/consumers. To keep the control and computational complexity low, we propose to align test patterns for all the concurrently tested light-wrapped cores and hence T l is calculated as
where bypass cycles are ignored. As a result, if for a given light-wrapped core the test schedule changes s times, then for each subset of patterns p s (for the s distinct divisions of the time allocated to the given core) the TAT will be computed based on the light-wrapped cores scheduled in each of these s divisions. The following example is used to better illustrate the computation of T l .
Example 3: In the case of m4953, Core 3 is compatible with light-wrapped cores Core 7 and Core 8 . Let us assume Core 7 and Core 8 are selected to be scheduled at the same test time with Core 3 , as shown in Fig. 10 (the given number of test patterns for these three cores has been selected only to illustrate this example). The time necessary to apply a pattern for Core 3 is updated each time the schedule changes. For the first ten patterns, the shifting time for each pattern of both Core 3 and Core 8 will be 
The same reasoning is applied for the last ten patterns when Core 3 is not concurrent with any other light-wrapped cores. The shifting time for each pattern will be max{L prod_3 , L cons_3 , L in_3 }. The variations in the shifting time for each of the three divisions of the schedule for Core 3 can be differentiated using a loadSize = max{ L prod , L cons , max{L in }}, determined by all the concurrently tested light-wrapped cores.
From the above discussion, we can see that the T l for a light-wrapped core changes every time when the schedule is updated. This dynamic attribute leads to a dynamic rectangle representation of the light-wrapped core's test and is caused by the usage of the daisy chain architecture for producer/consumer TAM groups.
Adapted Dynamic Rectangle Packing: To concurrently test the light-wrapped and P1500-wrapped cores, for the CUT TAM group we use an Adapted_TAM_Schedule_Optimizer algorithm (the pseudocode is shown in Fig. 11 ). The algorithm takes the core list C set , TIG, and the TAM division as inputs, and generates the schedule for each core and the overall TAT of the SOC. The algorithm is based on a generalized rectangle packing algorithm TAM_Schedule_Optimizer proposed in [12] . TAM_Schedule_Optimizer first finds out the paretooptimal TAM widths for each core. Next, a "preferred TAM width" for each core is identified from these pareto-optimal TAM widths such that the core's TAT is within a small percentage of its testing time at a maximum allowable TAM width. The test for each core is then scheduled using the preferred width, as long as there are enough TAM lines available. If the number of available TAM lines is insufficient to schedule any new tests, the resulting idle time is filled using several heuristics that insert tests to minimize the idle time. Whenever a currently running test completes, the number of available TAM lines is incremented, and the algorithm repeats the scheduling process for the remaining tests. This is a rather simple description of the algorithm. The reader is referred to [12] for more details on terminology. It should be noted that in the following we only emphasize the specifics of the proposed solution and the differences with respect to the original algorithm from [12] .
As described earlier, for light-wrapped cores, the test cannot be precomputed and represented as a static rectangle; its TAT (the width of the rectangle) varies with its schedule, hence its rectangle representation is computed dynamically (lines 7 and 16). In addition, since the TAT of the light-wrapped cores may change dynamically with its schedule, there are no "preferred TAM widths" for them. In [12] , the procedure Initialize (line 2) was used to compute the preferred width for each core; the parameters d and p were sometimes manually selected for SOCs with different available TAM widths to get a better result; since we need to call this procedure many times with different W CUT (see Algorithm 1), it is unlikely that a manual selection will lead to an optimal value. Consequently, in our implementation, we have fixed the two parameters to d = 2 and p = 1.0 (these two values give a generally good "preferred TAM width"); this may result in a different schedule and a slightly longer TAT in some cases when compared to the result in [12] . Line 3 initializes the cores that have not finished their schedule C unfinished , the currently available TAM width w_avail, and the current start time for unscheduled cores this_time, respectively. In line 9, the algorithm tries to schedule either a P1500-wrapped core with preferred TAM width or a light-wrapped core with the maximum allowable test pattern count that is able to fit in the idle rectangle (since test patterns for concurrently tested light-wrapped cores are aligned). When scheduling a light-wrapped core, its rectangle size is determined as the following. The TAM width for this core (its height) is the minimum value that minimizes loadSize and the TAT of the core (its width) is calculated using (2) . Once a core is scheduled (lines 9 and 10), the available CUT TAM width w_avail will be deducted the value of the assigned CUT TAM width for the core. Whenever a lightwrapped core is scheduled, L prod , L cons , and loadSize, for the currently scheduled light-wrapped cores, need to be updated (lines 11 and 19) and TAT will be recalculated (lines 12 and 20) . If a light-wrapped core has no internal scan chains inside and hence does not need any TAM lines in the G CUT group, we may be able to schedule it even when the available CUT TAM width w_avail = 0 in G CUT (lines 15 to 20) . This is because only G prod and G cons resources are necessary. Once there is no core able to be tested starting with this_time, the currently scheduled core with the minimum TAT will be finished (line 24), its TAM resources are released, and this_time advances to its finishing time; the algorithm try to schedule another core with these TAM resources. Note, due to test conflicts, we are only able to select a compatible core to be scheduled at any time (lines 9, 10, and 18); this is done through checking whether there is an edge in TIG between the cores currently under test and the to-be-scheduled core.
The worst case complexity C(ART) of the algorithm Adapted_TAM_Schedule_Optimizer can be estimated as follows. The while loop in line 4 of Fig. 11 is executed N c times, where N c is the number of cores of the SOC. In each such execution, a linear search in the O(N c ) core set is used to find the next core to be scheduled. In addition, these cores are also examined O(N l ) to determine whether they are compatible with the currently scheduled light-wrapped cores (lines 9, 10, and 15). Moreover, in each such execution, a collection of O(W CUT ) rectangles is generated for O(N l ) light-wrapped cores. The complexity of rectangle generation using Design_wrapper is O(sc log sc + sc · k), in which sc is the number of scan chains in the light-wrapped core and k is the TAM width [9] . As a result, the worst case complexity
Summary: In this section, we have presented a new modular SOC test architecture with reduced wrapper count. We have described the algorithms used for concurrent test scheduling of both P1500-wrapped and light-wrapped cores and have analyzed their computational complexity.
V. EXPERIMENTAL RESULTS
The purpose of our experiments is to find out how much test area can be saved, without affecting the test quality, and what are the implications of these savings on testing time. In addition to the hypothetical SOC m4953, benchmark SOCs from the ITC'02 SOC test benchmarking initiative ( [18] ) are used in our experiments. Since the functional interconnects are not provided in the original benchmark files, we have randomly generated them to support the proposed approach, including the direct connection between cores and functional busses [26] . Through randomization, we wanted to investigate what is the average impact of the proposed algorithms on the number of light-wrapped cores and the overall TAT. We have assumed ≤ min(N c , 8) ) of cores attached to it, where N c is the total number of cores in the SOC. In addition, all cores on a bus are assumed to be able to transfer data to and from the bus. We have also assumed that every core has a random number of q(1 ≤ q ≤ 3) producers, while the consumers for each CUT are generated from the producer-CUT relationship.
It should be noted that there are no wrapper design constraints in our experiments and the proposed algorithm in Section IV determines the wrapper type of each core, the optimal TAM division, and the test schedule. We have divided our experiments into three subsections. First, we discuss the test schedule for hypothetical SOC m4953, then we analyze the number of cores that can be light wrapped, and finally we discuss the testing time implications of using the proposed TAM design algorithm for testing the light-wrapped cores.
A. Experiment 1: Test Schedule Comparison for SOC m4953
First, we investigate our pruning technique used for rapidly dividing the available TAM lines into three separate groups: producer, CUT, and consumer for SOC m4953. The test parameters for the cores in m4953 are shown in Table I , in which N in , N out , N bi , and N sc denote the number of inputs, outputs, bidirectionals, and scan chains in the specific core, respectively. The length of each scan chain is shown in column SC length .
The TAT variation with W CUT for m4953 is depicted in Fig. 12 (given the total TAM width W ttl = 10). Using the proposed TAM_Division_Schedule algorithm, we obtain the minimum TAT of 955911 clock cycles for W CUT = 7, W prod = 2, and W cons = 1. For this particular case, we deal with a convex function, and the first identified local minimum is the only global minimum. If we set weight = 1.1 (see Algorithm 1), the search for the minimum TAT will start from W CUT = 8 and stop at W CUT = 6. This will prune the search space and hence reduce the computational time to a few seconds even for large SOCs while getting the best possible TAM division and schedule.
In Fig. 13(a) , we present the test schedule obtained for m4953 when all the cores are P1500-wrapped. When applying the new Decide_Wrapper_Type, if no wrapper design constraints exist, Core 3 , Core 6 , Core 7 , and Core 8 are selected to be light wrapped, and the test schedule is shown in Fig. 13(b) . We can observe that SOC TAT increases by approximately 45% due to the following three main reasons.
1) The test conflicts introduced in the light-wrapped SOC test model cause more idle rectangle areas in the bin. For example, although Core 6 can fit into the rectangle area above Core 2 , due to the producer-CUT conflict, Core 6 is incompatible with Core 2 and hence they cannot be scheduled at the same time.
2) The number of TAM lines used to access the P1500-compliant cores and the internal scan chains of lightwrapped cores (G CUT ) are decreased from 10 to 7. To support light-wrapped core testing, two TAM lines are used for loading the producers' outputs and one TAM line is used to unload the consumers' inputs. 3) When two light-wrapped cores are scheduled at the same time, the load time for each overlapped pattern may increase. In this example, light-wrapped Core 3 and Core 8 are tested concurrently, and the load time of the overlapped test patterns is dominated by the loading time of the producer outputs. It can be observed in Fig. 13(b) that Core 7 is scheduled from the same starting time as Core 1 even when the available number of CUT TAM lines G CUT = 0. This is because Core 7 has no internal scan chains and test data can be transferred only using G prod and G cons . Table II shows the reduction in the number of light-wrapped cores and the number of WBRs for the 100 random-generated interconnects for four benchmark SOCs [18] . N c is the total number of cores while N max _l , N min _l , and N ave_l denote the maximum, minimum, and average number of light-wrapped cores, respectively. N wbr is the total number of WBRs and N max_wbr , N min_wbr , and N ave_wbr denote the maximum, minimum, and average number of WBRs that are removed. The percentage reductions are defined as ∆N l (%) = (N ave_l /N c ) × 100 and ∆N wbr (%) = (N ave_wbr /N wbr ) × 100. To provide the same test quality for core-based SOCs, a high number of cores (approximately 40%) does not need to be wrapped with WBR cells. Based on functional interconnect topology, there are cases where the maximum number of light-wrapped cores can be half of the total number of cores (see column 3 in Table II ). More importantly, the number of WBRs that can be removed varies from hundreds for smaller benchmarks to thousands for the larger ones. Given the fact that each WBR can have an equivalent of 10 to 60 logic gates [25] (depending on the number of flipflops and the modes used for each cell), we believe that the proposed solution can yield significant savings in DFT area. Because these savings do not come at no expense, the implications on the testing time are discussed next.
B. Experiment 2: Number of Light Wrappers and Reduction in WBRs
C. Experiment 3: Comparison of Test Application Time
To investigate the implications on test application time we compare the results of the modular SOC architecture from Section IV against the case when the light-wrapped cores are tested sequentially using serial EXTEST after the test of all the wrapped cores. When the light-wrapped cores are tested sequentially, it is assumed that the entire parallel TAM bandwidth is allocated to their internal scan chains. It is very important to note, that the use of the optional parallel EXTEST feature of P1500 for producer/consumer loading may improve the scan time if the SOC integrator does not wish to use the proposed producer-CUT-consumer TAM division. However, parallel EXTEST is not applicable to the Test Bus architecture used in this paper, since it requires a TestRail architecture for the wrapped cores [4] . Unwrapping cores in a TestRail architecture and exploiting parallel EXTEST features for loading producers/consumers, is the topic of a completely separate investigation, which is currently undertaken by the authors.
Tables III-VI present test application time results when varying the total TAM width W ttl (note, only results with optimal TAM divisions are reported). T ave_se , T max _se , and T min _se denote the average, maximum, and minimum TAT TABLE II  NUMBER OF LIGHT-WRAPPED CORES FOR BENCHMARK SOCS   TABLE III  TEST APPLICATION TIME COMPARISON FOR g1023   TABLE IV  TEST APPLICATION TIME COMPARISON FOR p34392   TABLE V  TEST APPLICATION TIME COMPARISON FOR p93791 TABLE VI  TEST APPLICATION TIME COMPARISON FOR t512505 for the 100 random circuits when serial EXTEST is used to test the light-wrapped cores. T ave_p , T max _p , and T min _p denote the average, maximum, and minimum TAT for the 100 random circuits when the proposed producer-CUT-consumer architecture is used. The percentage changes are calculated using the formula ∆T se (%) = ((T ave_se − T )/T ) × 100 and ∆T p (%) = ((T ave_p − T )/T ) × 100, where T is the test application time result obtained using the algorithm described in [12] . It should be noted that since we did not manually select the d and p parameters, T is slightly different when compared to the result reported in [12] . As seen in all the tables, in almost all the cases, ∆T se is much higher than ∆T p , especially when the total TAM width W ttl is large. This shows the effectiveness of the proposed test architecture. We can observe that T ave_se does not change a lot with the variation of W ttl when serial EXTEST is used for testing light-wrapped cores. This is because the single-bit loading/unloading time for producers/ consumers dominates the overall TAT of the SOC and the increase of W ttl does not help in shortening it. While for the proposed producer-CUT-consumer architecture, when W ttl is increased, the algorithm will distribute more TAM lines to the bottleneck TAM group and leads to decreased TAT. One exception in the experiments is when W ttl = 8 for SOC t512505, where the serial EXTEST gives a better result. This is because the number of internal memory elements is much larger than the number of cores' I/Os in this SOC. When W ttl is small, test data transportation into the CUT is the bottleneck. Since the proposed architecture requires at least one TAM line for G prod and one TAM line for G cons , only six TAM lines are left in G CUT to transfer test data to/from the cores' internal memory elements, while all eight TAM lines can be used for the same duty when serial EXTEST is employed.
In can be seen in Tables III-VI that the average increase in TAT over [12] can vary from about 4% to 186% when the proposed architecture is used. For g1023, the penalty is higher than for the other SOCs. This is because, in addition to the reasons analyzed earlier in Experiment 1, the number of in ternal scanned flip flops in g1023 is comparable to the number of the producers'/consumers' outputs/inputs. Hence, a large amount of time is necessary to load/unload test stimuli/ responses, which imposes a high number of TAM lines assigned to producer/consumer TAMs. This leads to less TAM lines for CUTs to transport test data to/from the internal scan chains of all the cores. It can also be observed that the difference between the maximum and minimum TAT for different functional interconnect topologies may be very high, which is due to the unbalanced sizes of the cores inside the SOCs. For example, there are three large cores in p34392 (Core 2 , Core 10 , and Core 18 ). When these large cores are light wrapped and the functional interconnect topology causes plenty of test conflicts between them, then the TAT will increase significantly. However, this penalty in TAT can be greatly improved simply by wrapping the large cores (which are involved in many test conflicts) with P1500-compliant wrappers. For SOC p93791, the sizes of the cores are medium and hence no core dominates the whole SOC TAT. As a result, although test conflicts exist between cores, the idle time is not too large (in average the increase in TAT is about 37%). The TAT overhead for SOC t512505 is the smallest (in average about 12%) in the four benchmark SOCs. This is because one large core (Core 31 ) dominates the TAT of the entire SOC and the additional time used to test the other incompatible cores is insignificant.
In summary, removing WBRs to save area will obviously increase the testing time. However, in this section, we have demonstrated with experimental data that, when employing the proposed approach, the increase in testing time can be held and is significantly lower than using serial EXTEST for controlling/observing the inputs/outputs of light-wrapped cores.
VI. CONCLUSION
Unlike in boundary scan-based testing, where chips are manufactured before the board is assembled, in core-based systemon-a-chip (SOC) testing, the system integrator has the option of removing wrapper cells without sacrificing controllability and observability. To exploit this option, this paper has described a modular SOC testing methodology based on light-wrapped cores. The proposed approach is scalable and can be equally applied to both manufacturing test and diagnosis since it exploits only the functional interconnect topology and does not rely on the test data at hand. We have proposed a test access mechanism (TAM) design algorithm based on a division in three separate groups that facilitate concurrent testing of both P1500-wrapped and light-wrapped cores. This will limit the increase in testing time, caused by sharing wrapper cells between cores and, as long as the test schedule will match the capacity of the tester buffers, the penalty in the amount of time the chip spends on the tester will be insignificant. This makes the proposed solution particularly attractive, since it is capable to decrease the design for test (DFT) area requirements in complex SOCs and to reduce the propagation delays between cores, which may improve the SOC's performance.
