A novel approach for testing embedded memories in complex systems-on-a-chip (SOCs) is presented. The proposed solution aims to balance the usage of the existing onchip resources and dedicated design for test (DFT) hardware such that thefunctional power constraints are not exceededduring test while trading-offthe testing time ugainst DFT area and performance overhead. The suitability of software-centric and hardware-centric approaches for embedded memory testing is examined and to combine the advantages of both directions, a new built-in self-test (BIST).
Introduction
It is anticipated that embedded memories will account for 95% of the silicon area in complex SOCs by 2016 [7] .
Low costhigh quality testing of these heterogeneous SOCs, comprising a large number of embedded memories, poses a great challenge to the test community. The emerging solutions should address limited bandwidth, at-speed test for performance validation, long testing times, on-chip DFI area overhead and performance penalty, and excessive power dissipation, which may cause manufacturing yield and reliability concerns [ 
101.
There are two main approaches for testing embedded memories: direct access and BIST [4]. On the one hand, direct access of the embedded memory cores from the limited number of YO pins needs a high-performance automatic test equipment (ATE), as well as very long testing time. Thus, direct access is infeasible, in particular for complex SOC devices where transistor to pin ratio is high. On the other hand, memory BIST (MBIST) provides at-speed and highbandwidth access to the embedded memory cores, and it only needs a low cost ATE to initialize the tests and to inspect the final results. However, although BIST is stateof-the-art technology for embedded memory testing, unless carefully designed, it may induce excessive power, in addtion to performance and area overhead.
Motivation for New Approaches for PowerConscious Hardware/Software CO-testing
The embedded memory cores in an SOC can he divided into two groups: bus-connected memories (BCMs) and non-busconnected memories (NBCMs) . Although all the embedded memory cores can be tested by adding dedicated memory BIST wrappers, the high area overhead of BIST circuitry, as well as the performance penalty caused by intrusive DFT hardware, may prove to be the main drawback of this a p proach. Therefore, since a complex SOC usually contains one or more processing elements (e.g.. microprocessors), which use on-chip system busses to communicate with the memory cores, reusing the existing on-chip resources for testing the embedded memories can lower the overhead associated with a high number of dedicated MBIST wrappers.
In [ 121, a test methodology for testing SOCs using a microprocessor core was presented. A software-centric embedded memory testing approach (i.e., memory tesfing is undertaken by an on-chip processing elemeni which reuses the given functional resources and interconnect topology), eliminates the overhead caused by BIST wrappers, however it requires a significantly longer testing time than the existing hardware-centric approaches (i.e., memory testing is done using dedicated DF'C e.g., BIST hardware and interconnect resources) [1, 9, 18] , which can add to the overall cost of SOC test. The main purpose of this research is to develop an SOC test solution, which maintains the benefits of both software-centric and hardware-centric approaches, while satisfying the power constraints during test. The importance of power dissipation during test and its relationship to test cost parameters such as testing time and DFT area overhead are discussed next. Power dissipation is becoming a key challenge for the deep submicron complementary metal-oxide semiconductor (CMOS) digital integrated circuits. Placing more and more functions on a silicon die has resulted in higher powerheat densities, which impose stringent constraints on packaging and thermal management in order to preserve performance and reliability. While low power design techniques have been employed for more than two decades, the latest International Technology Roadmap for Semiconductors (ITRS) [7] anticipates that power will be limited more by system level cooling and test constraints than packaging. This is because, if packaging and thermal management parameters (e.g., heat sinks) are determined only based on the functional operating conditions, the high activity during test will affect both manufacturing yield and reliability [IO] .
On the one hand, when testing bus-connected memories in an SOC the power constraints are easily satisfied, however the main drawback lies into the serialization of the test schedule which leads to excessive testing time. To address this problem, a processor-programmable solution was proposed in [14] , where a BIST circuit was inserted between the processor and the system bus. Although it reduces the testing time, this solution can affect the overall SOC performance, since the BIST circuitry may increase the delay on the critical path. Therefore, new approaches need to be sought to address the problems associated with softwarecentric approaches. On the other hand, since not all the embedded memories in an SOC are connected to a bus system, hardware-centric memory BIST approaches are necessary for all the non-bus-connected memories. While the previous hardware-centric approaches [I, 9,181 have tackled the core-level aspects of memory BET, to further reduce the overhead at the system-level as well as to reduce the test control complexity associated with complex and heterogeneous SOCs, distributed solutions are necessary. For example, a distributed MBIST architecture was proposed in [ 3 ] . This architecture has relatively low area overhead, however its main drawback lies in the control mechanism, which configures all the memory wrappers to run identical test commands in each test session. This implies that memories running different test algorithms (e.g., heterogeneous memories) cannot be tested simultaneously, thus decreasing test control flexibility and increasing testing time. Furthermore, the testing time for each test session is dominated by the largest memory, which may lead to prohibitively long testing time under power constraints. To address powerconstrained embedded memory testing, one effective solution is to limit the number of concurrent memory blocks using test scheduling under power constraints. The BIST architecture proposed in [3], can he adapted to this solution, however it will lead to prohibitively long testing time under power constraints when the large memory blocks dominate the time spent for each test session [3]. Hence, new flexible BIST architectures need to be investigated, which will guarantee both low area and control complexity, as well as high test concurrency under given power constraints. To achieve this, a control mechanism must be provided to convert nonpartitioned testing[3] to partitioned testing with run to completion [ 5 ] , as well as to lower both the area overhead and the routing congestion associated with the test control for non-bus-connected memories. The remainder of this paper is organized as follows. The proposed power-constrained programmable MBIST architecture, facilitating hardwardsoftware co-testing is described in Section 2. Section 3 details the software structure and test schedule organization. A new test scheduling algorithm is described in Section 4. Section 5 gives the experimental results and Section 6 concludes the paper. To test BCMs, unlike the approach reported in [14], the proposed solution uses the standard bus interface to exchange data between the central processing unit (CPU) and the MBIST controller, and hence it can affect only the bus performance. Furthermore, if the critical path of an SOC is not on the system bus (which is a realistic case in practice), the presented solution will not influence the SOC performance. In addition, by iising a standard bus interface, the proposed MBIST module can be reused as an soft IP core.
When testing NBCMs, to overcome the disadvantage of supporting only non-partitioned test scheduling, as in [3], the proposed MBlST controller has full controllability for each wrapper. This new MBIST architecture supports partitioned testing with run to completion [5] , which gives more flexibility to power-constrained test scheduling. In addition, by running most of the non-time-consuming tasks, such as fetch and decode of test commands, in the processing unit using software, the BIST area overhead of the MBlST controller will be reduced.
Novel Programmable MBIST Controller
One of the distinctive features of the proposed solution is a shared BIST controller for heterogeneous memories. As shown in Figure 2 , to communicate with CPU or ATE, the MBIST controller has a bus slave interface to the on-chip CPU and a serial interface to the off-chip ATE. If one wishes to test memories using the ATE, the controller can simply bypass the test data to (from) the serial scan chain, which links all the memory wrappers together. To test BCMs, the MBIST controller passes parallel commands to (from) the BCM wrapper through the parallel interface between them.
To test all the other NBCMs, the MBIST controller has to do paralleWserial conversion, send commands to each wrapper, and then do seriaWparallel conversion for the results re- ceived from NBCM wrappers before passing them to the CPU. Another key feature of the proposed MBIST architecture is the interconnect mechanism between the memories, wrappers and controller. Due to its simple, yet powerful, interface, by programming the BIST controller, any hardwardsoftware co-testing schedule can be implemented.
MBIST Wrappers
Both BCM and NBCM wrappers (as shown in Figure 3  and The BCM wrapper, on the other hand, uses the bus interface to test all the memories connected to the bus. Since the testing process is serialized due to the common test access resource (functional bus), we need a single BCM wrapper, which has a parallel connection to the MBIST controller (see Figure 4) . Note, both BCM and NBCM wrappers can run different March elements received from the MBIST controller, which implies that they can run different March algorithms. This is'very useful for diagnosis purposes as described in the next section. To further reduce the complexity of manufacturing test, each wrapper is built-in with a default March algorithm, which can be activated using a single command.
Software Implementation
The BIST architecture described in the previous section can facilitate hardwardsoftware co-testing for a given test schedule, as described in Section 4. This section details the software implementation necessary to handle the test control flow. First, the test commands are generated using power-constrained test scheduling algorithm (see Section 4) and loaded to a rest memory (e.g., I-cache) which was pretested using ATE. The test software then reads commands from the test memory and sends data (read responses) to (from) the MBIST controller. Since the time-consuming tasks (i.e., on-chip generation of March tests) are conducted by dedicated BIST hardware, software functions are used only for test control which insignificantly affects the overall testing time. In the following the pseudo-code is given. 
Test Scheduling Algorithm
Since test scheduling is proven to be an NP-complete problem [SI, in this section we propose a greedy heuristic to deal with complex SOCs comprising hundreds of memories.
The Algorithm MemSchedule takes the test parameters of each memory (pi,lwi,luwi) and the power constraint (P-) as inputs, and it outputs the test schedule Tschrdule and the test method for each memory core M,,,, (i.e., wrapped or unwrapped). Note, this algorithm also increases the number of unwrapped BCMs, without affecting the testing time.
The test of each memory is represented as a rectangle whose width is the testing time and whose height is power dissipation. For BCMs two rectangles with different testing time are necessary (one for each test method), while for NBCMs one rectangle representation is sufficient. If two BCMs on the same singlemaster bus are unwrapped, then they cannot be tested at the same time due to bus contention, and we call these two memories incompatible. The proposed algorithm is based on the rectangle packing algorithm TAMscheduleopfimizer proposed in [8] for SOC testing.
The algorithm starts by initializing each memory wrapper test method M;=$,: all memories are treated as wrapped except those explicitly specified as unwrapped. Then the testing time Ti ,, of each memory is computed based on its test method (line 2). The currently available power constraint PaVl is initialized to Pmar and the number of unscheduled memories is initialized to the total number of embedded memories (line 3). As long as there is an unscheduled memory, the algorithm first finds a compatible memory mi with maximum testing time m a c a t , which does not exceed the power dissipation constraint (line 6). If such a memory mi exists, it will be scheduled and the available power dissipation constraint will be updated (line 8, 9). If no compatible memories meet the power constraint, we will record the idle power dissipation Pjdle. update the available power dissipation Pa", = 0 (line 11) and branch to the end of the schedule of the memory with the minimum end time min,,j. Afterward we update the new schedule information, including the new schedule begin time, the number of unscheduled memories Nunscheduled and available power dissipation Pa,, (line 12-15 9.
Pad-= P m j ;
. } else {.
12.

13.
Update schedule begin time;
Finish schedule of mj with mined;
14. To estimate and compare different test parameters (e.g.. BIST area overhead, testing time. performance penalty, power dissipation) of the proposed solution, a set of experiments have been performed using high-density emhedded SRAMs compiled for 0 . 1 8~ CMOS technology. In our LEON-based SOC platform, the NBCMs are a cluster of four 512x16 SRAMs, one lkx32 SRAM and one 4kx32 SRAM, while the BCMs are one 4kx32 SRAM and one 1 bkx32 SRAM. The wrappers have been configured to implement 9 March elements: fw). fr), (rw). (rwr), (nvw), (rwww), (rwrwr) , (rwrwrw), (wwnvw), which are the huilding blocks for most of the known March test algorithms [15, 18] . The last March element (wwrww) is used to run, the word-oriented March C-test (March-CW proposed in 1171). All wrappers also implement March-CW algorithm as the default memory test algorithm. Table 1 shows the BIST area overhead for the LEON SOC. By adding a very low area MBIST controller (0.14% in this case), the CPU can control self testing for most of the embedded memories (note, cache memories and the register file are tested using ATE as outlined in Section 3). The MBIST area overhead (control + wrappers) of this LEON SOC is less than 2%. Timing analysis indicates that the critical path is in the CPU core, which means that our approach does not introduce performance degradation. Howevef, the NBCM wrappers do affect the memory access speed. Table 2 shows the comparison of three different approaches (software-centric, hardware-centric, and programmable BIST core attached to the bus to be used with hardwardsoftware co-testing) for bus connected memories using the LEON platform [6] and applying the March-CW. algorithm [17] . As shown in the table, the proposed solution combines the benefits of the other two approaches. It requires low'area overhead, it may affect only the bus performance, and it maintains the flexibility of the software centric approach, while it reduces the testing time significantly, bringing it close to the best possible testing times given by the hardware-centric approach. Using the proposed MBIST architecture and the greedy heuristic for test scheduling described in Section 4, Table  3 gives a comparison between partitioned test with run to completion and non-partitioned test scheduling algorithms under power conspints. The memory cores included in, this experiment are of different sizes with the number of address lines ranging from 7 to 16 and the word size ranging from 8 to 32. The power dissipation for these cores ranges from 3.5 mW to 50 mW for 100 MHz clock frequency. On the one hand, it can be seen that the maximum testing time of non-partitioned testing is always greater than (or at least equal to) the partitioned testing with run to completion. The difference between the two testing times varies based on the maximum power constraint and memory configurations. For example, the more we relax the constraints, i.e., higher test concurrency can be achieved, the difference in testing time increases, unless the maximum concurrency has been achieved by both algorithms. One the other hand, because there is more idle time in non-partitioned testing, more BCMs can be unwrapped and tested serially using the on-chip bus. This means that for non-partitioned test scheduling we may increase the testing time at the benefit of lower BIST area overhead. It should be noted that the proposed greedy heuristic is very fast (e.g., for 300 memories it takes less than 1 second). Finally, the trade-off between BIST area overhead and testing time is shown in Table 4 . Item "BCM as NBCM" means the BCM has a dedicated wrapper and it will be tested as NBCM. To decrease BIST area overhead, i.e., all BCMs are unwrapped, the testing time will be increased. Similarly, to reduce the testing time, most of the BCMs have to be wrapped thus increasing the BIST area overhead.
Using the test scheduling engine, as a trade-off exploration tool, is beneficial especially when the testing time for the entire SOC is dominated by the scan testing time of embedded logic cores. In this case, we can explore different hardwardsoftware co-testing configurations until we match the logic cores' testing time with the lowest DlT area required by dedicated MBIST wrappers.
Conclusion
This paper has shown that for SOCs comprising tens to hundreds of embedded memories, a new solution called kardwardsojiware co-testing can reduce the testing time and aredperformance overhead under power constraints. 
