Abstract-We present two test-data delivery optimization algorithms for system-on-chip (SoC) designs with hundreds of cores, where a network-on-chip (NoC) is used as the interconnection fabric. We first present an effective algorithm based on a subsetsum formulation to solve the test-delivery problem in NOCs with arbitrary topology that use dedicated routing. We further propose an algorithm for the important class of NOCs with grid topology and XY routing. The proposed algorithm is the first to cooptimize the number of access points, access-point locations, pin distribution to access points, and assignment of cores to access points for optimal test resource utilization of such NOCs. Test-time minimization is modeled as an NoC partitioning problem and solved with dynamic programming in polynomial time. Both the proposed methods yield high-quality results and are scalable to large SOCs with many cores. We present results on synthetic grid topology NoC-based SOCs constructed using cores from the ITC'02 benchmark, and demonstrate the scalability of our approach for two SOCs of the future, one with nearly 1000 cores and the other with 1600 cores. Test scheduling under power constraints is also incorporated in the optimization framework.
I. Introduction

R
ECENT years have seen the large-scale integration of an increasing number of embedded cores in system-on-chip (SoC) design. It has been predicted that the number of cores on a single chip will continue to increase in the future [1] , [2] . These predictions are being matched by industry trends. For example, Nvidia has developed a manycore chip, called Fermi, that includes 512 CUDA cores [3] . Intel has announced a coprocessor called Knights Corner that has over 50 cores [4] .
To support high data bandwidth in such manycore SOCs, a suitable on-chip communication infrastructure is needed. Continued increase in wire lengths relative to feature size, and the associated problems of interconnect delay and power consumption, have motivated research on scalable communication infrastructures. Furthermore, the continuous reduction in the design cycle time requires not only logic cores, but also the on-chip interconnect fabric to be reusable.
Studies have shown that compared to buses, a packetswitched network-on-chip (NoC) is better suited for communication-intensive applications with as few as eight cores [5] . For systems with up to 16 cores, the bus-based interconnection fabric was found to be better only in cases of lighter workloads. For SoCs with a larger number of cores, an NoC offers many benefits, including reduced wire lengths, less power consumption, and better scalability. An NoC is therefore viewed as a promising alternative to today's busbased communication fabric. Automatic test equipment (ATE) is used to deliver test patterns to the logic cores in an SoC and analyze test responses. Since the ATE represents a significant investment, it is necessary to develop a systematic approach for utilizing ATE resources optimally. Several studies reported reduction of test cost through effective utilization of test resources, such as tester memory and tester channels, to test an SoC using dedicated test access mechanisms (TAM) [6] - [8] .
Instead of implementing a dedicated TAM for many-core SOCs with an NoC interconnection fabric, the NoC infrastructure itself can be used for test data delivery to the cores. Access points are used to interface an ATE with the NoC for transporting test data between ATE and cores over NoC, and special test wrappers are implemented around the cores [9] - [12] . To ensure congestion-free test data delivery and minimize test time, a number of scheduling techniques have been developed [12] - [17] .
The minimization of the overall test time using the NoC is a complex problem requiring co-optimization of the number and placement of access points, of the distribution of ATE channels to these access points, and of the assignment of cores to access points for test data delivery. Most previous approaches neglect one or more of these aspects. To be practical, the optimization technique must be scalable to hundreds of cores. This paper addresses the above challenges. Its major contributions are as follows.
1) We present a novel algorithm for minimizing test time for NoCs with arbitrary topology and dedicated routing that consistently yields shorter test times than previous approaches. This algorithm can also be used to minimize test times for SoCs using dedicated TAMs.
2) We model test-delivery optimization problem for an
NoC-based many-core SoC with grid topology as a grid partitioning problem and develop a dynamic programming solution. 3) We show that optimal grid partitioning leads to significant reductions in test time. The partitioning solution is optimal in that minimum test time is obtained for rectangular partitions. The test times obtained using DP for the rectangular NoC topology are close to (no more than 10% in most cases) TAM-independent provable lower bounds. 4) We demonstrate the scalability of the proposed method by deriving optimization results for realistic SoCs of 0278-0070 c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
today and for the future. We present results for a 400-core SoC (20 × 20 NoC) of today, and even larger SoCs of the future, 992 cores (32×31 NoC) and 1600 cores (40×40 NoC). The results were obtained in reasonable time using modest computing resources. 5) We show how test scheduling can be carried out under power constraints. 6) The proposed dynamic-programming algorithm computes the Pareto frontier for a range of access-point counts and total test-pin counts in a single run. Therefore, a wealth of data can be obtained to effectively choose the number of access points and test channels for the SoC. The rest of the paper is organized as follows. Section II discusses prior work. Section III introduces the subset-sum problem, and shows how it can be used for solving test scheduling. Section IV provides details about the proposed dynamic programming algorithm. Test scheduling under power constraints is discussed in Section V. Results are presented in Section VI, and Section VII concludes the paper.
II. Related Prior Work
A typical NoC is made up of several network tiles. A network tile, in turn, is composed of a router, a core, and a core-to-network interface [18] . Routers are responsible for routing communication data according to a routing protocol. Routers are interconnected using links that are composed of multiple channels. A network interface translates between a router's and a core's communication protocols.
The topology of the network describes the logical layout of the NoC. In a grid-based layout, a router has multiple input and output ports, e.g., one port each for five different directions (north, south, east, west, and core). A router receives data on one of its input ports and forwards it to one of the output ports according to the routing protocol. In the XY-routing protocol, data is first routed along the x-axis and then along the y-axis. Since this scheme is not affected by on-chip network traffic, routing decisions are deterministic.
The testing of the NoC infrastructure, fault detection, and reconfiguration in the presence of faults have been studied in [19] , [20] . A scheduling algorithm that assigns a higher priority to cores requiring longer test time, and finds shorter test-delivery paths to these cores, was proposed in [13] . This approach was extended to take into account power constraints in [14] . An optimization method to identify dedicated routing paths and incorporate precedence constraints and shared BIST resources was provided in [21] .
A differential clocking scheme to reduce power consumption in the cores, and utilize the capability of an NoC to run at a frequency higher than scan clock, was presented in [15] . Multiple clock speeds can also be leveraged to reuse common links connected to cores on a time-sharing basis [16] .
A test-wrapper design for the reuse of functional interconnects was introduced in [9] , [12] . TAM width constraints imposed by the NoC flit width, and the associated problem of flit-bit under-utilization were highlighted in [22] . Other design approaches are presented in [10] and [11] , but these methods do not show how test time reduction can be achieved.
The compression of test input data for reducing the number of test input pins was studied in [23] . Test-output compression to reduce output pin-count was discussed in [17] . Our approach can be extended to incorporate input and output compression as proposed in [23] and [17] .
The need for partitioning the NoC to avoid jitter-less transport of test and response data, and distributing different bandwidths to these partitions was introduced in [9] . The TR-architect test scheduling algorithm [24] was extended in [12] to minimize test time in a NoC with arbitrary topology and dedicated routing. However, a systematic method for creating valid partitions in an NoC with a grid topology that only support XY-routing method was not considered.
In [17] , a general test-scheduling problem was formulated and solved using ILP. Furthermore, a heuristic approach was also presented, where contention was avoided for routers and links between cores that were assigned to different access points. However, the heuristic is based on ILP that does not scale well with the number of cores and the number of access points. No methodology was discussed on an appropriate selection of the total number of pins or the number of access points. Moreover, no strategy for locating optimal positions of access points was discussed in [17] .
Two methods for delivering test data using NoC have been studied in previous work. In packet-based scheduling [13] , every packet is scheduled independent of others; a path is reserved for one packet and packets belonging to the same test set can be assigned a different set of NoC resources. The second approach, based on dedicated routing [21] , schedules all test packets of a core using the same path in the NoC. Due to its simplicity, test scheduling using dedicated routing has been further explored in [12] and [25] .
III. Test Scheduling for NoCs With Dedicated Routing
A. Problem Description
A graph-theoretic formulation for test-time minimization for NoCs with dedicated routing is presented in [12] . A graph is first constructed using the topology of NoC with network tiles representing nodes and edges between nodes replacing links. The goal is to partition the graph into disjoint connected components, and distribute pins to each component such that the maximum test time of each component is minimized. The test time of a component is equal to the sum of the test times of the cores in the component. The test time of a core depends on the number of pins allotted to the component to which the core belongs. We use the same formulation for devising a test-scheduling algorithm for an arbitrary topology. We further assume a dedicated routing path for the testing of each core in line with previous work on test scheduling [12] , [21] , [25] .
We build on the system model presented in [9] where ATE interface (access points) and core wrapper do all the protocol and width conversion such that both the ATE and the circuit under test (CUT) are unaware of the on-chip routing protocol and NoC design. We assume a width conversion ratio of one in this paper; any other ratio can easily be accounted for.
B. Outline of Algorithm
If we exclude the initial delay introduced during setting up of dedicated routes in NoC for test-data delivery and response collection, the test time minimization problem for NoCs with dedicated routing can be considered as a special case of the bus-based TAM optimization problem [12] . This is because any partition of the set of cores is a valid solution candidate to the latter, while only those partitions that represent connected components in the graphical representation of the given NoC topology are solution candidates to the former. Hence, we first revisit the TAM optimization problem described in [6] and [24] for a bus-based TAM architecture. By limiting the partitions considered by our algorithm to those that form valid connected components, the algorithm is also directly applicable to the NoC test scheduling problem (Section III-F).
The TAM optimization problem, discussed in [6] and [24] , can be stated as follows. Given a set of cores with respective test time information for a range of pin widths, total TAM width, and the maximum number of TAM partitions, we have to find an assignment of cores to the TAM partitions and the TAM width of each partition such that the overall test time is minimized. We view this problem from a different angle; we partition the set of cores into disjoint subsets, and find an optimal distribution of pins or TAM wires to these subsets to minimize test time. A systematic method of enumerating candidate partitions is discussed.
Our algorithm consists of three parts: a subset-sum formulation to effectively enumerate candidate partitions in a reasonable time, a greedy method to optimally distribute pins to a given partition, and a brief introduction to the subset-sum problem and a method to solve it. We first describe the subsetsum problem (SSP) before elaborating on how we utilize SSP to tackle the TAM-optimization problem.
C. Subset-Sum Problem (SSP)
The SSP is posed in the form of a question. Given a set of n integers, X = {x 1 , x 2 , · · ·, x n }, does there exist a nonempty subset, whose elements sum to zero? This problem is known to be NP-complete [26] . An equivalent version of this problem asks the question if there is a subset of the given set with the sum of its elements equal to a given integer s. There exists no algorithm that can solve a given instance of SSP in polynomial time. However, a method based on dynamic programming solves the problem optimally in pseudo-polynomial time [26] . Fig. 1 outlines this method-we call it SolveSubsetSumthat has runtime complexity of O(n(L SSP −U SSP )), where L SSP is the least element of the given set, and U SSP is the sum of all positive integers of the set, and L SSP ≤ s ≤ U SSP .
The binary element SUM(i, s) of the 2-D array SUM stores a true if it is possible to create a subset from the set {x 1 , x 2 , ···, x i } with sum s, otherwise it stores a false. From the procedure SolveSubsetSum, it is evident that SUM(i, s) is true only under three conditions.
1) The subset consists only of x i , i.e., s = x i .
2) There exists a subset with sum s without using the element x i , i.e., SUM(i − 1, s) is true.
3) The subset contains x i in addition to elements from the set
is true. This recursive formulation is the foundation for solving SSP in pseudo-polynomial time and we leverage this idea for TAM optimization. We use repeated invocations of the above method on multiple instances of SSP to enumerate several candidate partitions of the given set of cores in K subsets, and use a greedy approach to distribute pins to these subsets.
D. Optimal Pin Distribution for Partition
Suppose the given set of cores is partitioned into K disjoint subsets, and the cores in a subset are assigned to the same TAM. Moreover, no two cores from different subsets use the same TAM. The number of pins available for distribution is P, where P is the total TAM width.
We present a greedy algorithm for distributing P pins and prove that the greedy approach is optimal for a partition of a given set of cores. Fig. 2 outlines the algorithm for pin distribution. In each iteration, the subset having the maximum test time is assigned an additional pin. The algorithm terminates after every pin is assigned.
Let G P = {g 1 , g 2 , · · ·, g P } be the solution generated by the greedy approach, where g i denotes the index of the subset to which the ith pin is assigned. Clearly, 1 ≤ g i ≤ K, for all i. Let O P = {o 1 , o 2 , · · ·, o P } be an arbitrary sequence of pin assignments that leads to an optimal solution.
Let T g (i) denote the SoC test time after the ith pin is assigned using the greedy approach. Similarly, let T o (i) denote the SoC test time in the i th step for the given arbitrary sequence. Furthermore, assume that τ g (i, b) is the test time for the b th subset after the i th iteration using the greedy approach. Let τ o (i, b) denotes the same quantity for the arbitrary sequence.
Lemma 1: The test time does not increase in any iteration of the greedy method, i.e., T g (i + 1) ≤ T g (i) .
Proof: The test time of a core does not increase with the addition of a pin. Since a pin is added at each step or iteration to exactly one subset, the test time of that subset can only decrease or remain the same. The test times of other subsets remain unaffected. This is also true for the arbitrary sequence, i.e.,
Lemma 2: In the partial sequence G k = {g 1 , g 2 , · · ·, g k }, suppose that the pin k was added to the subset b m during the kth step, i.e., g k = b m and T g (k) = τ g (k, b m ). If a pin is removed Proof: According to the greedy procedure, a pin is added to a subset only when it has the maximum test time during an iteration. Since a pin is removed from b , the pin must have been added to b during an iteration,
Therefore, the updated test time of b after removing a pin is τ g (i, b ). Note that the greedy procedure keeps on selecting the same subset until its test time is strictly lower than another subset. Since
Note that the addition of the removed pin to any other subset does not reduce the SoC test time.
If
The greedy algorithm of Fig. 2 finds an optimal distribution of pins for a given partition on the set of cores.
Proof: We prove the theorem by contradiction. At step i, if the distribution of pins is same for both the methods, T g (i) = T o (i) holds. If the distribution of pins is not the same, then there exists at least one subset that is assigned at least one more pin in the greedy approach than what is assigned to it in the other approach. If we try to rearrange the pins in the greedy sequence to match the pin distribution obtained from the arbitrary approach, then by Lemma 2, the test time can only increase; hence
E. Enumeration of Candidate Partitions
The TAM-optimization problem has been shown to be NP-complete using transformation from the bin-packing problem in [27] ; therefore, a polynomial-time algorithm cannot be designed for solving the problem optimally. The test time of cores is mapped to integer elements of the given set in SSP. Assuming that only one pin available to each TAM, a set consisting of the test times of all cores using just a single pin is created. The test times vary from small to large values depending on the scan chain length and the number of test patterns of individual cores. If the range is very large, it may be computationally prohibitive to run dynamic programming ( Fig. 1) , both in terms of time and space. Therefore, the test times are scaled down by a factor of the lowest test time before approximating them to nearest integer values. The procedure optimizeTAM of Fig. 3 is called on this set to create K subsets. The greedy method outlined in Fig. 2 is executed on the resulting partition and the test time is noted.
The element SUM(i, s), for every s, holds an answer to the question whether the set {x 1 , x 2 , · · ·, x i } has a subset, the elements of which sum to s. For every value of s, we get a unique subset, if SUM(i, s) holds a true. The procedure findSubset finds such a subset; it searches through the entries of the 2-D array SUM and constructs a solution. This is the backbone for enumerating candidate partitions. Note that the procedure optimizeTAM is a recursive procedure and the maximum recursive depth is equal to the total number of subsets that are needed to be created. In each recursive step, a subset X s is created with sum less than equal to s. Then a new set X filter is obtained by removing the elements in X s from X. While X s is appended to the list candidate partition, which stores the subsets in a candidate partition, the set X filter is fed in the next level of recursion. After returning from the recursive call, X s is removed from candidate partition.
This methodology can potentially enumerate all partitions, but for keeping the computation time low, we do not search through all the entries of the array SUM. First a lower bound L TAM is computed as
, where x ∈ X. The variable s is varied from L TAM − to L TAM + , where is a parameter that controls the size of solution space that is to be searched. Since the size of X decreases in every recursive step, we reduce the value of by half in our experiments in successive recursive steps (not shown in Fig. 3 ). Either a timeout can be set on the overall procedure, or a limit be placed on the maximum number of candidate partitions that are to be enumerated.
F. Application to NoC
Similar to extension made for NoCs in [12] , the method described in the previous subsection is modified for solving the test-delivery problem in an NoC. Listed below are the required modifications.
1) The findSubset procedure only returns a subset, whose elements form a connected graph. 2) In the distributePins, the number of pins assigned to each subset is no more than the channel width. The method described in [12] creates several connected components in its initial step and subsequently attempts to find a partition that minimizes test time; there is no control on how many connected components are created in the final solution. In contrast, our approach takes the number of access points (or connected components) as an input parameter. In Section VI, we show how restricting the number of connected components in [12] adversely impacts test time. Even if the number of connected components is kept unrestricted, for larger NoCs with a grid topology, we demonstrate that the method in [12] produces a significantly large number of access points with test times that are worse than the test times produced by our method with much fewer access points. More access points leads to additional DFT cost and higher power consumption.
IV. Dynamic Programming Approach for Grid
Topology The above approach and the approach from [12] are not applicable to an NoC with mesh topology that only supports XY-routing protocol for on-chip communication, as they create connected components of arbitrary shapes and do not ensure that packets are delivered without any conflict when the routing mechanism is XY. Moreover, XY routing can be viewed as a special case of dedicating routing, if the paths taken by packets in the former case are encoded in packet headers in the latter; therefore, the method that we propose in this section can still be used for NoCs with grid topology that use dedicated routing. We will demonstrate that by using XY-routing with grid topologies, we obtain better solutions than reported in [12] for larger NoCs.
A. Preliminaries
The basic idea of our approach is to partition a 2-D gridstructured NoC into rectangular regions, and interface each region with the ATE such that the testing in each region can be carried out in parallel without any contention for routers and links. The cores in a region are tested sequentially, and the regions are tested in parallel. This concept is illustrated in Fig. 4 . The figure shows a partition of a 4 × 4 NoC into three rectangular regions. These regions are interfaced with an ATE through ATE interfaces. The ATE drives test in all three regions simultaneously.
The goal of partitioning is to minimize test time for the region with the maximum test time. This method is suitable for NoCs that implement the XY-routing protocol, which is a popular routing algorithm because of its easy hardware implementation and guaranteed deadlock-free routing.
The advantage of optimal partitioning is illustrated in Fig. 5 . Each subfigure shows a different partition of the same 4 × 4 2-D grid network, where each cell represents a network tile. The number in each cell is the number of clock cycles needed to test the core in that network tile. Note that the additional time needed to establish a connection to a core from an access point is neglected here because it is very small compared to the test times of the cores. For the purpose of illustration only, we assume that the test pins are evenly distributed to the access points. The quantity T refers to the total number of cycles that is required to test the chip. Since each region is tested in parallel, the maximum number of clock cycles needed to test any region is T . Fig. 5(b) shows an optimal partitioning that reduces test time by 40% compared to the partitions shown in Fig. 5(c) . The number of regions is three in this figure, but in general, it is an input parameter to the proposed framework.
B. Problem Formulation
An NoC N with a grid topology can be viewed as a rectangle in a 2-D Cartesian plane with the bottom-left corner of N coinciding with the origin (0, 0) of this plane. Any rectangle R can be uniquely identified with a four-tuple representation [i, j, l, w] where (i, j) is the bottom-left corner of R, and l and w are the length and width of R, respectively. This representation can be captured by a concise expression, R ≡ [i, j, l, w]. If the dimension of N is m × n, we can write:
Our goal is to optimally partition N into rectangular regions such that testing can be carried out independently in each region, thereby minimizing overall test time. This particular choice of regions being rectangular is motivated by two reasons: first, the popular XY-routing algorithm [28] , [29] ensures that there is no conflict in test data transportation because data will not cross the boundary of regions; second, it is more tractable to store and use results for subproblems in our dynamic programming-based approach.
We have to ensure that each region can be accessed by the ATE without requiring a dedicated TAM, hence, not all possible partitions of N are admissible. The two partitions shown in Fig. 6 , for example, are inadmissible; region B is enclosed from all sides and it cannot be tested independently. We say a region to be located at a boundary if it is not enclosed, i.e., if it has at least one boundary edge. An edge of such a region R is said to be a boundary edge if it is incident on one of the edges of the rectangle [0, 0, m, n]. In Fig. 6(a) , region A has three boundary edges, regions C and D have two each, E has one, and B has no boundary edges.
We create a partition with the help of a sequence of separators. A separator is any line segment parallel to either of the coordinate axes, and divides a region completely into two subregions. We impose the constraint that a separator has to be parallel to one of the axes. This constraint ensures that only rectangular regions are formed. We do not consider a sequence of line segments as shown in Fig. 6 (b) to create a partition; it can be easily shown that such a sequence will always create an enclosed region. A separator is horizontal (vertical) if it is parallel to the axis y = 0 (x = 0).
The partition shown in Fig. 7 (a) can be easily mapped to a binary tree with the region represented by the NoC as the root of the tree and each intermediate node representing the subregion that is created by placing a separator in its parent region [ Fig. 7(b) ]. The orientations of separators are also captured in the tree in terms of how a nonleaf node is intersected. We call a region corresponding to a leaf of this tree as a leaf region. Alternatively, a leaf region can be defined as a region with no subdivisions.
We define the cost of a leaf region as the sum of the test times of the cores in that leaf region. The test time of a core is a function of the number of pins assigned to test the core. The number of pins used to test a core equals the number of pins assigned to the leaf region in which the core is present. The cost of an intermediate region R is the maximum cost over all costs of the leaf regions contained in R. The cost of R also depends on the corresponding partition size, where the size of a partition is the number of leaf regions created by that partition. Let us use π N (K, P) to denote the cost of N, where K is the given partition size (number of regions) and P is the number of available test pins. We can now formulate our problem as follows. Our goal is to determine an optimal sequence of separators such that: 1) each core is present in exactly one leaf region; 2) each leaf region is located at a boundary (boundaryregion constraint); 3) π N (K, P) is minimized. For a given K and P, we use π R to refer to the cost of any region R. Our goal is also to compute the pin assignment to each leaf region (pin-assignment problem), i.e., which of the K pins are used to access different leaf regions.
C. Characterization of Optimal Solution Structure
Dynamic programming is an effective technique to attack optimization problems that possesses an optimal substructure, i.e., for which an optimal solution can be computed from optimal solution to its smaller subproblems. Dynamic programming is typically applied when the same subproblem appears multiple times when a problem is decomposed; storing a solution to each subproblem avoids recomputation when it arises again. In this section, we show why we use dynamic programming, and how we find an optimal partition and construct the solution to the problem of test delivery.
In order to obtain an optimal solution, we have to select a sequence of separators that create a partition of size K. To simplify the following discussion, we are not considering the effect of pin width on test time, so we assign a fixed test time to each core. The extension of the algorithm to cover pin distribution is subsequently presented in Section V.
Consider a m × n grid. Suppose that this grid has to be partitioned into two regions. In this case, only one separator is needed. If placed vertically, the separator can be placed at n−1 different positions. Similarly, m−1 possible positions for placing a horizontal separator accounts for a total of m + n − 2 different positions. In order to minimize the test time, we compute the overall test time for each position of the separator and report the position resulting in the minimum test time. Increasing the partition size K to three requires an additional separator to be placed. If the first separator l 1 is placed as shown in Fig. 8(a) , the second separator l 2 can either be placed in Region A or Region B. There are m + a − 2 ways to place l 2 in A, and m + n − a − 2 ways to place in B, resulting in a total of 2m + n − 4 ways. Thus, for each choice of a position for l 1 , there are several choices for placing l 2 . The analysis can be continued in this way for larger values of K.
Suppose we are given an optimal partition for the grid and the position of l 1 is as shown in Fig. 8 . The cost of N (π N ) is the maximum of the costs of the two regions, A and B. Let us consider three cases: 1) the cost of region A, π A , is greater than that of region B (π B ), i.e., π A > π B ; 2) π B > π A ; and 3) π A = π B . In the first case, π A should be the optimal cost for A. If not, optimal cost for A can replace π A to give a better value for π N , which is a contradiction because we assumed π N to be optimal. A similar argument holds for the symmetrical second case: if π B is not optimal, i.e., if there were a better value for π B , we can always substitute this value to yield a better value of π N , which would be a contradiction. The third case can be analyzed in the same way to conclude that an optimal solution for this partitioning problem encapsulates optimal solutions for its subproblems. This property indicates optimal substructure, which is a basic requirement for applying dynamic programming. Let us recursively express this property for the above example
An optimal value of π N can be obtained by sweeping l 1 across its all possible positions. For each different position ρ
where π is an optimal solution to the actual problem and ρ varies over all the possible locations for placing l 1 . Note that if the grid is to be partitioned into K regions, the sum of the partition sizes of the regions A and B will be K. The partition sizes of A and B are not accounted for in the above equation. The partition cost is a function of partition size.
Since changing the partition size of a region has an impact on overall partition cost, the value of π depends on how many separators we assign to regions A and B. For a given P, let us useπ R (k) to denote the cost of R when it is partitioned in k regions. Factoring in the partition sizes for A and B, (2) becomes
where 0 < x < K. Let us formulate a general equation for all possible subproblems. For any region R ≡ [i, j, l, w], we rewrite (3) by splitting it into two cases: the case when a separator is swept from top to down, and the case when it is swept from left to right. These two cases and the regions so formed are marked in Fig. 8(b) and (c). The final cost is the minimum cost obtained in these two sweeps
where 0 < ω < w,
. We use (4) in our algorithm. Since a subproblem is required to be solved in more than one larger problem, we use dynamic programming to solve this problem. The key idea is to store results for all subproblems and find the optimal cost π , which is same asπ N (K). Next we discuss the algorithm and the data structures used.
D. Algorithm
The algorithm begins by enumerating all possible rectangles in N, and for each such rectangle, computing and storing the optimal cost for all partition sizes k, 0 < k ≤ K. This is shown in Fig. 9 . The variables l and w are used to vary the length and width of the current rectangle, respectively. The pair (i, j) denotes the bottom-left coordinate of a rectangle. A 5-D integer array M is used for storing the optimal cost of all rectangles and for all partition sizes. The first four dimensions are used for identifying a rectangle, and the last dimension denotes the partition size. The variable k iterates over all partition sizes less than equal to K. the separator, first from left to right, and then from top to bottom. The variable v stores the current position of the vertical separator. The optimal cost of the current rectangle depends on the partition cost of the two new subregions formed by the separator. The partition cost of subregions, in turn, depends on their respective partition sizes. We use a variable x for varying the partition size in one of the subregions. If x is the partition size of one of the subregions, then k − x is the partition size for the other subregion.
In addition to producing the cost of an optimal partition, we also have to keep track of the sequence of separators used to construct this optimal solution. The procedure ConstructSolution, described in [30] , can be used for this purpose.
V. Additional Constraints and Enhancements
In this section, we incorporate the boundary-region constraint, the pin-assignment strategy, and power constraints into our solution approach.
A. Boundary-Region Constraint
The boundary-region constraint mandates the inclusion of only boundary regions in the final solution. As described in [30] , the procedure computeAndStore can be easily modified to enumerate only admissible partitions.
B. Pin-Assignment Problem
The test time of a core varies with the TAM width assigned to it. In this paper, we do not consider the case when the TAM width of a core exceeds the channel width. To incorporate the effect of pin count on the partition cost, we need to add one more dimension to the search space. The overall partition cost can now be expressed as π N (K, P) where P has to be distributed among all the K regions. The optimal substructure equation can be rewritten as
where 0 < x ≤ K, 0 < p < P and ρ varies over all possible positions of the separator being placed. This equation can be further extended to consider the cases of horizontal and vertical separator, as in (4), but is being omitted here for the purpose of brevity. For solving this problem, we evaluate the optimal solution using the same approach as in Section IV. The dimension of array M is increased to six such that M[i, j, l, w, k, p] stores the optimal partition cost of region [i, j, l, w], if the partition size is k and the number of pins assigned to this region is p. For each value of k, we add one more loop that iterates over all the pin-counts possible. Moreover, when a separator is placed, in addition to varying the size of partitions of the subregions A and B, all possible distributions of the p pins to the two subregions are evaluated. For any region R, π R (k, p) is not evaluated for p < k, because each region should be allotted at least one pin. We use the equality, π R (k, p) = π R (k, p − 1), if p takes values that are greater than k times the flit width. The array B is also a 6-D structure now with each element being a six-tuple structure. Two more tuples are added to specify the pins allotted to each subregion.
The run-time complexity of the proposed method is
otherwise, as shown in [30] . The space complexity is O(m 3 n 2 KP) [30] .
C. Power Constraints
The reduction of test application time by manipulating power profiles of test sets has been studied in [31] . Given a power constraint, the problem of minimizing the test time is presented in [31] . A similar problem in the context of NoC has been formulated and solved using ILP techniques in [32] . The scheduling problem under power constraints was shown to be computationally harder than the general test scheduling problem in [32] , and the proposed method does not scale well with the number of cores. This section elaborates on ideas presented in [31] to initially treat power as a design objective to minimize power consumption, and then subsequently as a design constraint to achieve minimization of the test application time.
1) Profiling of power consumption: Accounting for power during test scheduling entails the modeling of power consumed by the individual cores. A simplistic approach is to flatten the power profile of a core to the worstcase instantaneous power consumption value, i.e., its peak value [31] . The simplicity and reliability of this model, called the global peak power approximation model (GP-PAM), is achieved at the cost of including significant false power in the model; false power is the power that is not consumed, but still being considered. 
The entire test length of a core is divided into two segments; P hi is the peak power for a test length of L hi , peak power P lo for a test length of L lo , and P hi ≥ P lo . The splitting of the power profile provides more flexibility in scheduling cores under a power constraint because it approximates false power more accurately than the GP-PAM model. We adapt the benefits of 2LP-PAM in our DP approach, and discuss how to construct power profile of a region, a collective profile of all the cores present in the region. We show how this technique helps in approximating power consumption and creating room for test concurrency between cores from different regions under a power constraint. Note that all cores in a given region are still scheduled sequentially, but the order in which they are scheduled affects the ability to simultaneously schedule cores in other regions. The power profile for a leaf region R is created by the procedure createPowerProfile, as outlined in Fig. 11 . The procedure creates the two peaks P hi is avoided as much as possible, thereby leading to minimization of the power consumption of the two regions taken together. The creation of schedules for cores is abstracted out with shift and flip operations on profiles, and this helps in minimizing power consumption. The shape of the resulting profile depends on the values of the peaks of the profiles that are merged, and the power constraint. For our running example, the merged profile can look like the profile with four peaks (Fig. 12) . To be consistent with the two-peak power model, the number of peaks has to be reduced to two. In Fig. 13(a) , different areas are marked that have to be collapsed with the original profile to obtain two peaks. Depending on the magnitude of the areas marked, the final profile can look like Fig. 13(b) or Fig. 13(c) . The decision on which areas to collapse depends on the area under the profile, and the profile with the least area is chosen. We next discuss the impact of power constraints on the merging of profiles. In the case when a power violation occurs, the profile for B is shifted until no violation is caused. If the two lower peaks from the two regions exceed the power limit PL, i.e., P A lo + P B lo > PL, then the two power profiles cannot overlap. In such a case, we sort the four peaks in descending order and place them side-by-side to create a profile shown in Fig. 14(a) . Different combinations of the marked areas are grouped together to reduce the number of peaks to two. The final profile can be any one of the profiles as shown in Fig. 14(b)-(d) . We have implemented a dedicated transformation procedure for all kinds of profiles that can result from the merge operation. The enumeration of these cases and the corresponding transformations are straightforward and therefore omitted. 3) Integration with DP: The proposed method for power profile manipulation has been integrated with our DP approach. When a separator is placed in a region, rather than taking the maximum of the testing times of the two subregions as the test time for that region, we compute the new test time as the test length of the power profile obtained after merging the power profiles of the two subregions. Line 6 in Fig. 10 is modified accordingly. Since any two power profiles are merged without violating power constraint, power profile of the NoC N does not violate the constraint. The runtime complexity of the approach under power constraints remains unaffected. The procedure to merge two profiles takes a constant time for computing a merged profile. When creating a power profile for a leaf region using createPowerProfile procedure, the complexity increases by a factor of mn (an efficient implementation using DP increases the complexity by a factor of min(m, n)·log(mn)), but this part only constitutes the initialization step and does not dominate the runtime of the main algorithm.
D. Reducing Runtime Complexity by Factor of P
The runtime complexity of the proposed approach was found to be a function of P 2 . The number of pins P is much larger in magnitude when compared to other parameters, such as m, n, or k. We next show how the factor of P 2 can be reduced to P. The approach presented in the previous section distributes pins on both sides of the separator such that the sum of the pins on both sides is equal to a given pin count p, and selects the distribution that minimizes test time. All possible combinations are tried making the search exhaustive. This is repeated for all possible values of p; therefore, a factor of P 2 is seen in the computational complexity. Exhaustive search can be reduced to selective search by observing that given a position of separator, and an optimal distribution of p pins in the two subregions created by the separator, the optimal distribution for a count of p + 1 pins is obtained by assigning the newly added pin to the subregion having greater test time.
The modified computeAndStore procedure that implements selective search is shown in Fig. 15 . While computing the minimum test time for a non-leaf region, we maintain an optimal distribution of pins in the two subregions for each position of separator v, for each pin-count p, and partition size x of one of the subregions (the partition size of the other sub-region is computed as k − x, where k is the partition size of the parent region for which we are computing the minimum test time). For the base case when the pin-count p equals k, the number of pins assigned to the subregions equals their respective partition sizes, i.e., x and k − x. This is the only possible (and hence optimal) distribution because each leaf region has to be driven by at least one pin. The arrays M and B are updated whenever a better result is found. A similar procedure is used for sweeping horizontal separators. 
VI. Experimental Results
In this section, we first describe how we created our test cases. The rest of the section includes the following.
1) For NoCs with arbitrary (irregular) topologies, we compare the results obtained from our approach based on subset-sum with an implementation of [12] . 2) For a mesh-based NoC, we compare the results obtained using DP to that obtained with ILP [17] . Since the comparison is based on the problem instances that can be solved with ILP in a reasonable time, we take small problem instances first. These test cases provide a good baseline with which we compare the quality of solutions produced by our approach. 3) Results showing the effect of power constraints on the total test time is reported and compared with [32] .
A. Test Scheduling Without Power Constraints
We used six SOCs from the ITC'02 SoC Test Benchmarks [33] , namely, d695, g1023, p34392, p22810, t512505, and p93791, with 10, 14, 19, 28, 31, and 32 cores, respectively. Since we are interested in evaluating performance for a large number of cores, we created additional SoCs by taking cores from the last two SoCs and replicating them. The test times for all TAM widths (constrained by flit width) for each core were obtained using the design wrapper algorithm [6] .
For mesh-based NoCs, we adopted the same approach as outlined in [13] , [16] , [17] for generating the topology of the NoC. We assumed XY-routing, a switching delay of three clock cycles per router for access point-to-core or coreto-access point path establishment, and one cycle each for transmitting header and tail flits. A flit width of 32 bits was assumed. For each leaf region, we placed an access point at the middle of its longest boundary edge, and computed the cumulative routing delay of the region as the sum of routing delays for all the cores located in that region. It can be easily shown that, by taking the derivative of the routing delay and equating it to zero, the routing delay is minimized when the access point is placed at this specific location of the region. The routing delay for a core during path establishment is three times its Manhattan distance from the access point assigned to that core. We executed the ILP model using the FICO XPress-MP Solver [34] that was also used in [17] .
Each region contains exactly one access point, hence, the number of regions (partition size) in our work and the number In the first experiment, we compare our method for an NoC with an arbitrary topology with the method presented in [12] . Since the latter is based on TR-Architect [24] , it does not restrict the number of access points (ATE interfaces) that are used; it reports the number of access points that are required to minimize the test time. We modify ModifiedCreateStartSolution procedure to initialize a solution with a fixed number of access points (K). Table I shows the results for the two methods for two NoC test cases. The column K * does not restrict the number of access points for the method of [12] . The column [12] , K * reports the number of access points required (K * ) along with the minimized test time. We use K * to derive the test time for the proposed method, which is shown in the last column. The NoC examples (irreg1 and irreg2) are taken from [12] and they both consist of nine routers. Each router is assigned a core taken from the SoC d695. Since there are nine routers and d695 consists of 10 cores, we ignore one core, namely Module5, for this experiment. Note that the greedy procedure of pin distribution is optimal at every iteration; hence, solutions are obtained for a range of pin counts in a single run. The CPU time was found to be negligible in all cases. Our method consistently reports better solutions than [12] .
Next we present results for a mesh-based NoC. We took a 6 × 6 NoC obtained from the SoC t512505 (31 cores) and compared the results obtained using the two approaches. The ILP proposed in [17] does not address the problem of optimal access point placement. Therefore, we show two different sets of results for the ILP model; one obtained by a random placement of access points (ILP * ) and the other obtained by placing the access point at the locations determined by DP (ILP * * ). The results for different values of K and P are shown in Table III . The CPU time for DP is less than 1 s.
The ILP solver was allowed to execute until it found an optimal solution. Note that an optimal solution for ILP * may be worse than a solution obtained using DP due to the nonoptimal placement of access points. Placing the access points at the locations suggested by the proposed approach almost always resulted in reduced test time. These results demonstrate that the proposed DP approach can produce results that are very close to that obtained by ILP. The DP approach leads to optimal solutions under the constraint of rectangular partitions. The ILP approach can lead to better solutions than DP by allowing nonrectangular partitions. Table IV compares the test times obtained using various methods with TAM-independent lower bounds on test time that are derived using [24] . (Note that these lower bounds cannot always be achieved due to bottleneck cores [6] .) The third column reports the lower of the test times obtained using ILP * and ILP * * . Since the lower-bound expression in [24] is not tied to any particular TAM design and utilizes only the volume of test data that must be transported, these bounds are also applicable to the test-time minimization problem in NoCs. Moreover, an NoC imposes additional constraints on the classical TAM optimization problem; therefore, the lower bounds of [24] hold in our NoC-based TAM scenario as well, and we expect actual test times to be larger than the lower bounds. Nevertheless, closeness to lower bounds is a measure of the effectiveness of an optimization method. For the problem instances listed in Table IV , the test-time results are only 5%-13% higher than provable lower bounds.
We show similar results for a larger 14 × 14 NoC obtained by replicating cores from the SoC benchmark p93791; see Table V . We set a time limit of three hours for ILP for each value of K and P, and report the best intermediate results obtained within that time limit. It was observed that with an increase in the number of regions K, ILP took longer time to report the first intermediate solution. We set three hours as the limit because no noticeable improvement was seen in the ILP intermediate solutions after this duration. The CPU time taken by DP is only 4 minutes and 8 seconds, which is negligible in comparison to the cumulative execution time of 1 day and 12 hours taken by ILP for all the 12 cases shown in Table V . Our approach also reported better results for some larger values of K. Note that for instances for which ILP * * yields lower test times than DP, a combination of the two methods can be used. An effective partition can be first identified using DP, and then the test-pin assignment problem can be solved using ILP, as in [17] . However, for larger problem instances, ILP is not feasible due to high computation requirements.
We next show results for a 20 × 20 NoC obtained by replicating cores from the two SoC benchmarks; see Table VI . Considering the size of NoC, a CPU time limit of 6 h was set for ILP for each of the 12 cases shown in Table VI * for all the cases was found to be 2 days and 7 h, ILP * * took 3 days. The CPU time taken by ILP is clearly impractical, and the ILP approach does not scale with the number of cores and the size of the partition. In contrast, the CPU time for DP is only 25 min and 10 s.
To further demonstrate the scalability and benefits of the proposed approach, we evaluated the DP method for an SoC of the future that has nearly 1000 cores (a 32 × 31 NoC). The DP procedure completes in 4 h of CPU time. In order to evaluate the quality of the solution (test time obtained), we developed a simple baseline heuristic of generating a partition. Among all intermediate regions available for partitioning, the region having the largest number of network tiles was selected and a separator was placed randomly. Pins were distributed in the ratio of the dimension of the leaf regions. We generated 100 such partitions for each different values of K and a value of P = 150, and report the mean, minimum, and maximum test time for each case, as shown in Table VII . It can be seen from the table that DP provides consistently superior resultstwo orders of magnitude reduction in test time compared to the mean test time for the baseline case. Compared to the minimum test time for the baseline case, the test time reduction is in the range of 13% to 48%.
We also considered an SoC with 1600 cores (40 × 40 NoC). The DP solution (for 2 ≤ K ≤ 5) was obtained in 3 h of CPU time for P = 150. The test times obtained from DP was consistently lower than that for the randomized baseline We also examined the scalability of the subset-sum-based method for large SoCs. We ran the procedure optimize TAM on the 32 × 31 NoC for K = 4. Fig. 16 shows the percentage reduction in test time reported by the procedure over the DP-based method for varying values of and P. The figure also shows the CPU time needed by the procedure for each value of . It can be seen that when is high, the subsetsum-based method is capable of producing better results than the DP, but takes more CPU time. The reduction in test time was found to be as high as 3.3% when was set to 20000. The test time reported by the procedure optimize TAM was 1.4% more than that for DP for = 5000. The CPU time varied from 7.2 h to 1.5 h as was swept from high to low values, whereas DP took 4 h to produce the results. It will be seen later that for the 32 × 31 NoC, the CPU time for DP can further be lowered to 11 min 50 s using the speed-up technique discussed in Section V-D. As mentioned in Section III-E, the optimize TAM procedure scales down the test times of individual cores for avoiding computational bottlenecks in solving subset-sum problem instances. This approximation step may lead to the elimination of some valid partitions from the solution space. Therefore, optimality is not guaranteed with the subset-sum-based approach.
We next examine our experimental results for further analysis. [12] for all these cases. The test-time reduction (TTR) column shows the relative reduction of test time obtained by adding an additional access point (using the DP approach). It can be seen that the magnitude of reduction in test time gradually decreases. This is because the test time of a core depends on the number of pins assigned to it and as the partition size increases, the number of pins available per region decreases. By increasing the pin count, we observe that the effect of sudden decrease in TTR can be moderated. For example, for P = 80, the TTR rapidly dipped to 0%, but we were able to moderate the sudden decline by allotting 120 pins, and get further benefits by increasing the pin count to 160. However, the number of available pins on the ATE is limited, hence it is natural to ask what is a suitable choice for the partition size and the pin-count that should be used, and how can we calculate these values. These questions will be addressed in future work.
The row K * in Table VIII shows the result produced by [12] when no restriction is placed on the number of access points to be used. It can be seen that [12] reports an extremely large number of access points, which can be harder to implement in practice. Moreover, a large number of access points can lead to the associated problem of power consumption because of test parallelism. We report lower test times than [12] using fewer access points. For P = 160 (not shown in the table), the improvement achieved by our method over [12] is as high as 37.6% when K is restricted to 8. When K is not restricted, [12] resulted in a test time (using 23 access points) that is worse than the test time reported by DP with only seven access points. Since our approach only creates rectangular partitions, a simple postprocessing step, such as that implemented in the procedure ModifiedReshuffle of [12] , can further reduce test time by moving cores from one region to another. We also report lower bound values for the two values of P in the last row of Table VIII. The test times that we obtained are only 9.4% (12.5%) larger than the provable lower bounds for P = 80 (P = 120).
B. Power-Constrained Test Scheduling
To assess the impact of power constraint on test scheduling, we ran our approach on two NOCs: a 6×6 NoC and an NoC with 100 cores (10×10), both constructed out of cores from the benchmark circuit p93971. Due to the lack of information on power consumption of these cores, we assumed that the power consumption in a core is directly proportional to the sum of the number of core's inputs, outputs, bidirectional pins, and memory elements, the same approach as adopted in [32] . All values for power consumption used were relative with respect to the total sum of the power consumption of all cores, which is referred to as system power consumption in [32] . We therefore refer to power in terms of a normalized value relative to the total system power. Table IX compares the test length obtained by our approach with that obtained using the ILP model from [32] . All power constraints are defined as a fraction of the system power. Scheduling with the 1.0 power constraint is equivalent to scheduling without power constraints, as no schedule can exceed the total system power. Since our approach approximates the power consumption for a set of cores using manipulation on power profiles to create an approximate profile, the performance of the approach depends on how tightly the approximation scheme bounds the actual power profile from above. The test lengths were found to match closely with the results obtained using [32] for all values of the power limit.
As our approach is necessitated by the intractability of problem instances involving large NoCs, we present the results for a 10×10 NoC in [35, Table X ] for different power constraints and partition sizes. Since, as in this case, each core contributes very little to the system power consumption, the power constraint was set to 25% of the system power consumption at first, and then subsequently the power budget was reduced by 5% at each step. The runtime complexity remains the same as before, and no appreciable difference in runtime was found for the reported cases.
C. Speedup technique
We next show the effect of the speedup technique, discussed in Section V, on the computation time for DP. In [35, Table XI] , the third column corresponds to the approach taken for reducing the runtime complexity by a factor of P. The speedup is clearly evident for larger NOCs.
VII. Conclusion
We have developed a scalable solution to the problem of optimizing test-data delivery in an NoC-based manycore SoC. A formulation based on the subset-sum problem has been proposed for NoCs with dedicated routing and arbitrary topologies. For grid topologies supporting XY routing, testtime minimization has been solved using DP, which computes optimal solutions for rectangular partitions. Results for NoCbased manycore SoCs constructed from ITC 2002 benchmarks have shown that the proposed method yields high-quality results, and scale to large SoCs with many cores. Test scheduling under power constraints and a speedup technique have been incorporated. Since dynamic programming solutions are recursively constructed from solutions to underlying subproblems, the proposed method can inherently facilitate design-space exploration for effective test planning.
