Abstract-Up to 53% of the time spent on testing current Intel microprocessors is needed to test on-chip caches, due to the high complexity of memory tests and to the large amount of transistors dedicated to such memories. This paper discusses the methodology used to develop effective and efficient cache tests, and the way it is implemented to optimize the test set used at Intel to test their 512-kB caches manufactured in a 0.13-m technology. An example is shown where a maximal test set of 15 tests with a corresponding maximum test time of 160.942 ms/chip is optimized to only six tests that require a test time of only 30.498 ms/chip.
I. INTRODUCTION
C ACHE memory plays a significant role in todays microprocessors. An ever increasing portion of the microprocessor is being dedicated to this ultra fast on-chip memory, in order to ensure a continuous supply of information to high performance processors. These memories are organized in multiple layers [e.g., level-1 (L1), level-2 (L2), and level-3 (L3) caches], with different performance requirements, and are designed to serve various system requirements (e.g., instruction caches versus data caches) [23] . The amount of on-chip memory embedded alongside the processor occupies about 50% of the chip area [26] , and is expected to reach 90% by 2011 [8] . In terms of transistor count, the numbers are even more staggering. Memory consumes up to 75% of the transistors today in a modern processor design [23] .
As the size of caches increase, there is a corresponding increase in the impact of memory testing on the overall test time and fault coverage of the microprocessor. At the same time, as the dimensions of manufactured devices decrease, new more complex memory faults are observed that require specialized memory tests to ensure their detection [24] . Keeping in mind that the allowed defect per million (DPM) budget for the entire collection of on-chip memory is extremely low, these trends stress the increasing importance of memory testing in microprocessors today and in the future [33] .
This paper describes the methodology used industrially in high volume manufacturing (HVM) testing to develop cache tests in modern microprocessors. This paper also identifies some important memory tests and evaluates their effectiveness. In addition, it discusses the application of the test development methodology in a test experiment performed at Intel to test their 512-kB caches manufactured in a 0.13-m technology [22] . This paper is organized as follows. Section II discusses the methodology of cache test development in a current HVM testing environment. The test development process is divided into four main steps: maximal test set generation, test application, test optimization, and, finally, the optimal test set generation. Section III discusses the first step in this methodology, where the maximum test set is generated. Test application (second step) is presented in Section IV, where the results are analyzed of a test experiment performed on caches embedded on Intel microprocessors. Section V shows how to perform the third step of test optimization based on feedback from the test application step. The optimal test set (fourth step) is analyzed in Section VI. Finally, Section VII ends with the conclusion.
II. INDUSTRIAL CACHE TESTING
The flow of microprocessor testing is complex and time consuming. Fig. 1 shows a typical representation of such a test flow [14] . Fig. 1 The wafer test, where chips are tested on wafer before dicing and packaging, consists fully of structural testing, where internal functional blocks (or structures) are tested separately rather than testing the chip as a whole system through its input/output (I/O) pins. In this stage, cache is tested using fault localizing tests with the objective of repairing failing cells with redundant ones.
The package test, which is performed on individual packaged components, is the second phase of the test flow, and consists of three stages: burn-in, structural testing, and, finally, functional testing. The burn-in stage thermally stresses the components to ensure sensitizing early life reliability problems. The structural testing stage is similar to the structural testing stage performed in the wafer test phase of the test flow. Cache is tested again in this stage using fault detecting tests with the objective of eliminating faulty chips from the flow. The final stage of the test flow is the functional testing stage, where the whole chip is tested at-speed through its I/O pins to ensure the functionality of the processor, and to sort different chips according to their speed (so-called speed binning).
The specific test flow of on-chip cache testing shares the same general characteristics with the flow of stand alone memories [6] . Fig. 2 shows the relative breakdown of test time between different processor components for the Pentium 4 processor [26] . Cache testing consumes 53.3% of the total test time, and thereby represents the largest portion of the test flow. Second on the list is logic testing, which consumes 27.0% of the total test time. The remaining 19.7% of test time is distributed among I/O testing, parametric testing as well as others.
This paper is mainly concerned with the process of test development and application of fault detecting tests for caches, applied during the structural testing stage of the package test, as shown in Fig. 1 . The test development process described here is also very similar to the one used for fault localizing cache tests implemented in the wafer test phase. The cache test development process is based on the kitchen sink principle, as depicted in Fig. 3 . The test development process starts out with a maximal set of cache tests that ensures detecting all possible faults in the memory. In general, there are more than 50 test algorithms implemented in this stage [33] . This set is not optimal and takes an excessively long test time to complete. All of these tests are applied to a large sample of microprocessor caches, as part of the manufacturing test set used to test microprocessor chips on the production line in the fab. Based on the feedback from this analysis, it is possible to optimize the maximal test set to construct an optimal test set that is particularly suited to the cache under test.
III. MAXIMAL CACHE TEST SET
This section describes the construction and content of the maximal cache test set used to kick off the process of constructing an optimal industrial test set as shown in Fig. 3 . The test set is constructed from a number of different sources: 1) some well-known traditional tests; 2) tests developed specifically for the cache under test; and 3) theoretically derived tests using the fault primitive (FP) analysis [5] . Typically, there are more than 50 tests in the maximal test set [33] , but to keep the discussion simple this paper will discuss the test development process for a subset of the maximal test set, referred to here as the base tests (BTs). An overview of the used BTs is given in Section III-A, while the needed stresses during test application are presented in Section III-B. Table I lists the used BTs along with their test length (TL), where denotes the number of bits in the cache, C the number of columns, and R the number of rows. The used march notation is explained as follows [30] . A complete march test is delimited by the " " bracket pair, while a march element is delimited by the " " bracket pair. March elements are separated by semicolons, and the operations within a march element are separated by commas. Note that all operations of a march element are performed at a certain address, before proceeding to the next address. The latter can be done in either an increasing ( ) or a decreasing ( ) address order. When the address order is not relevant, the symbol is be used.
A. Overview of Used BTs
As mentioned previously, the set of used BTs consists of three FP-based BTs (theoretically derived) and only 12 well-known traditional BTs. The BTs with the most promising fault coverage and unique fault detection are discussed here [1] , [2] , [7] , [17] , [29] , [31] , [32] .
1) FP-Based BTs:
The FP-based BTs consist of three march tests listed in the first block of Table I. • March SS [20] to target all simple static memory faults. Static faults are faults sensitized by performing at most one operation (e.g., the state of the cell is always stuck at one, a read operation to a certain cell causes that cell to flip). Simple faults are faults which cannot influence the behavior of each other. That means that the behavior of a simple fault cannot change the behavior of another one, and therefore masking cannot occur. • March RAW [19] to target some dynamic faults. Dynamic faults are faults that can only be sensitized by performing more than one operation sequentially (e.g., two successive read operations cause the cell to flip, however, if only one read operation is performed, the cell will not flip [3] , [11] ). March RAW is designed to target dynamic faults caused by read-after-write operations, which have been observed in real designs [19] . • March SL [21] to target all simple linked faults. Linked faults are faults that do influence the behavior of each other [4] , [28] , [31] . That means that the behavior of a certain fault can change the behavior of another, such that masking 
2) Traditional BTs:
A set of 12 well-known BTs has been selected, with the most promising fault coverage and unique faults detected. These BTs are listed in the second block of Table I . For Hammer, the notation means that the write 1 operation is performed 10 times successively to the same cell. Two versions of Galpat and of Walking 1/0 tests are used, each with a complexity of . As an example, the read operation in GalColumn is restricted to only the cells in the same column as the base cell ( ), instead of galloping throughout the whole memory. In these tests, the notation means to go through all the bit of the memory in an incrementing fashion, while considering the current cell as the base cell . For GalRow, the notation means to apply a (read 0) operation in an incrementing order to the cells of the row of the base cell, and apply (read 1) operation to the base cell after each operation. A similar explanation applies to in GalColumn. Similarly, for WalkRow and WalkColumn, the notation ( ) means apply a operation using an incrementing address order to the row (column) of the base cell, and skip the base cell.
B. Used Stresses
Each BT has to be applied using several different stress combinations (SCs). An SC specifies the way the test is performed and, therefore, it influences the sequence and/or the type of the memory operations. The used SCs are the addressing directions and the data-backgrounds.
The used addressing directions consist of and : "Fast " ( ): "Fast " addressing is simply incrementing or decrementing the address in such a way that each step goes to the next row. "Fast " ( ): "Fast " addressing is simply incrementing or decrementing the address in such a way that each step goes to the next column. . Table II lists the 61 tests applied at both high voltage and low voltage. A test consists of a BT (i.e., test algorithm) applied using a particular SC. The total number of tests is, therefore, the number of BTs (15), multiplied by the corresponding number of SCs (#SC) and with two voltages (high and low), a total of . The column "TT/SC" in Table II gives the test time, in milliseconds (ms), of each BT using a single SC for the tested chip. To calculate the test time per BT, the "TT/SC" has to be multiplied by "#SC" and with two (high and low voltage). The total test time of all tests is 160.942 ms/chip, where the four nonlinear BTs consume about 43% of the total test time. In Table II , the solid, the checkerboard, column stripe, and row stripe data-background are denoted as "s," "c," "cs," and "rs," respectively. The different addressing orders are denoted as " " and " ." A " " in Table II indicates that the corresponding SC is applied, and a " " denoted that it is not (e.g., WalkRow is used with (fast ) and s (solid) data-background). Due to test time constraints, only a subset of SCs have been selected for traditional BTs, while all SCs have been implemented for the FP-based BTs, as shown in Table II . The impact of SCs on the coverage of traditional BTs is not very interesting, since this has already been studied in detail and published many times in the literature [1] , [2] , [7] , [17] , [29] , [31] , [32] .
IV. TEST APPLICATION RESULTS
The second step in the test optimization process is "test application," as indicated by the shaded block in Fig. 4 . This section presents the results of running the tests in Table I on a huge number of Intel 512-kB caches. The exact number of caches tested is not given due to confidentiality reasons.
All SCs have been implemented at two different voltage levels: high voltage (HVcc) and low voltage (LVcc). These voltages are generated externally by the tester and applied at the inputs of the microprocessor. Testing of the 512-kB caches resulted in the following.
• HVcc testing: 1545 chips failed, of which 1343 chips failed all 61 tests, and 202 chips failed only some tests.
• LVcc testing: 1543 chip failed, of which 1320 chips failed all tests, and 223 chips failed only some tests. From now on, this paper will only concentrate on the chips that did not fail all tests, since they are the most interesting ones for further study and analysis. Fig. 5 shows a Venn-diagram of the influence of the voltage levels on detectable faults, as derived from the database of the test results. The total number of devices found to be faulty is . The fault coverage (FC) at HVcc testing is 202 out of 254, while the FC at LVcc testing is 223 out of 254. Note that 171 faults are detected at both LVcc and HVcc. In addition, 52 faults are detected at LVcc only while 31 faults are detected at HVcc only. This clearly explains the necessity of testing at both voltages in order to achieve a good FC. Low voltage testing is important for detecting faults caused by resistive bridges [13] , [16] , [29] , while high voltage testing is important for detecting resistive open defects [9] - [11] .
The FC of a BT is defined as the union of the fault coverages of its corresponding SCs. A die belongs to the union (i.e., considered detected by a BT) if at least one SC of that BT detects the die to be faulty. For example, MATS+ is implemented using -s (i.e., "fast " and solid data-background) and -s. The fault is considered detected if at least one of the two MATS+ tests detects the fault (see Table II ). Table III shows the unions and the intersections of the 15 BTs for HVcc, while Table IV shows the results for LVcc. A die belongs to the union of two BTs if at least one of the two BTs finds the die to be faulty, and belongs to the intersection of two BTs if both BTs find the die to be faulty. The first column in each table gives the BT number, while the second column gives the name of the BT. The column "FC" lists the fault coverage of the corresponding BT, and the column "UFs" gives number of unique faults (UFs) each BT detects. Unique faults are faults that are only detected once by a single test. As an example of The union and the intersection of each pair of BTs is shown in the rest of the tables. The numbers on the diagonal give the FC of the BTs, which are also listed in the column "FC" (for example, at HVcc, March SS has ). The part above the main diagonal shows the union for each BT pair, while the part under the diagonal lists the intersection of each BT pair (for example, at HVcc the union of March C-and PMOVI is 185 and their intersection is 179). Based on the two tables and the Venn-diagram, one can conclude the following.
A. HVcc Testing
1) The total number of faulty chips detected is 202.
2) The best BTs, in terms of FC, are: March SL and March G with , March SS and March RAW with , and March C-with . 3) There are 12 unique faults, detected with four tests. These are listed in Table V , together with their FC and the number of unique faults (# UFs) each BT detects. 4) The best union pair in terms of the FC is 195 achieved with GalRow and March G, and with GalRow and March SL (see Table III ).
B. LVcc Testing
1) The total number of faulty chips detected is 223.
2) The best BTs, in terms of FC, are: March C-with , March SL with , and March SS and March RAW with . 3) There are no unique faults detected at LVcc testing. 4) The best union pair in terms of the FC is 220 achieved with March C-and March RAW, see Table IV . It is important to note here that the three FP-based BTs (i.e., March SS, March SL, and March RAW) score very high for both HVcc and LVcc testing.
Using Tables III and IV , it is possible to determine BTs detecting supersets of faults in comparison with other BTs in this experiment. For example, GalColumn detects a superset of WalkColumn at HVcc testing (see Table III ). This is because the intersection of the two tests is 160 (which is the FC of WalkColumn), and their union is 164 (which is the FC of GalColumn). Keep in mind that in this experiment the number of stresses used with each BT is not the same for all BTs, see Table II . Determining the BTs detecting supersets allows for deriving a reduced set of BTs that has the same FC as the initial test set (see Table I ). The reduced set is given in Table VI ; it consists of nine BTs for HVcc as well as for LVcc, where eight BTs are common BTs.
V. TEST OPTIMIZATION
The third step in the cache test development process is "test optimization," as shown in the shaded block of Fig. 6 , where the maximal test set is reduced based on the FC feedback from the test application step. In the following, the different BTs are analyzed and compared with each other first and then the impact of stress combinations (SCs) is analyzed. 
A. Analysis of BTs
Here, the FC is evaluated of the three FP-based BTs (i.e., March SS, March SL, and March RAW, denoted as FP-BTs) and compare it with the FC of the other 12 BTs. One useful way to do that is to calculate the union of the FC of the FP-BTs and compare it with the union of the FC of the other 12 BTs. Fig. 7(a) shows the Venn-diagram of the FC union of the three FP-BTs as compared with the 12 traditional BTs (see Table II ). The total FC is 202. Fig. 7(a) shows that 188 out of 202 faults can be detected with the FP-BTs only, while the other 12 BTs detect 200 out of the 202 faults. There are 14 faults that are not covered with the FP-BTs, 11 of them are unique faults (see Table V ). Note that the total number of UFs is 12, and that March SL (an FP-BT) detects one of them.
1) Analysis of HVcc Testing:
Consider now the set of the three BTs shown in Table V , which detect UFs at HVcc (March SL is excluded), and let "H-UF-BTs" denote this set of BTs (i.e.,
). The analysis of the FC of H-UF-BTs reveals that the union of their FC is 198 out of 202 faults, as is shown in Fig. 7(b) . In addition, the union of H-UF-BTs with the FP-BTs achieves 100% FC (i.e., 202 from 202). Note that 188 out of 202 faults are covered by the FP-BTs, and that the latter detect 4 faults that are missed by H-UF-BTs. Thus, the FC achieved with the initial test set of 15 BTs can also be achieved with a short test set consisting of six BTs: three FP-BTs and three H-UF-BTs.
Any fault detected with FP-BTs can (probably) be explained since these BTs target well-known predefined faults. However, most detected UFs (by empirical tests) cannot be explained with the well-known fault models. This means that additional faults exist which still should be modeled. The detected UFs call for a detailed analysis in order to understand the defect mechanisms behind them. A deep understanding of the defect mechanisms and their faulty behavior will allow for modeling the faults and for introducing shorter/optimal BTs that cover such faults.
2) Analysis of LVcc Testing: Fig. 8 shows the Venn-diagram of the FC of the three FP-BTs, as compared with the rest of 12 BTs at LVcc testing. All faults detected by the FP-BTs are also detected by the union of the other 12 BTs; these consist of 213 faults out of 223 (i.e., 95.51%).
As it has been shown in Section IV, there are no BTs detecting UFs at LVcc (see Table V ). The question is now what are the faults missed by the FP-BTs, and which BTs (from the initial BT set) have to be added to the FP-BTs in order to achieve the complete FC (i.e., 223/223). A detailed analysis showed that a least Hammer should be added. The next question is then which kind of faults Hammer detects, and how they can be modeled. These questions remain still to be worked out.
Based on the previous analysis, one can derive an optimal set of BTs detecting all faults at HVcc, as well as at LVcc (see Table VII ). Testing at HVcc requires 6 BTs and at LVcc requires 4 BTs; 4 BTs are common. Inspecting the table reveals that some of the BTs are empirical tests (e.g., GalRow, Hammer), not designed to target well-defined faults models. Such tests detect faults that cannot be explained with well-know fault models, and still remain to be understood and to be modeled. This will allow for developing low-cost fault model-based tests.
B. Impact of SCs on BTs
In order to identify the best SCs needed to maximize the FC of the three FP-BTs, the impact of the SCs on these three tests is discussed here. The results of a detailed analysis of the SCs are summarized in Table VIII , where the FC of each SC is listed along with the three FP-BTs. Table VIII also lists the minimal number of SCs to be used with each of the three FP-BTs in order to achieve 100% FC. The minimal SCs that have to be 
VI. OPTIMAL TEST SET AND SCS
The last step in the cache test development process is the formulation of the optimal test set to be used for high volume manufacturing (HVM) testing of the memories, as represented by the shaded block in Fig. 9 .
It has been shown in Section V-A that in order to achieve the same FC as that of the initial 15 BTs (with a total of 122 SCs) only a minimal set of six BTs is required (see Table VI ). In order to get an idea about the impact of selecting appropriate SCs on the overall test time while keeping the same FC, the minimal number of SCs that have to be used with the minimal test set (i.e., six BTs) will be presented. Table IX gives the SCs needed to be used with each of the six BTs. The column "TT/SC" lists the test time of each BT per SC. The column "#SC" gives the number of SCs each BT requires at HVcc and LVcc. For example, March SS has to be used with 2 SCs at HVcc and 5 SCs at LVcc. An "HL" in Table IX denotes that the SC is used both at HVcc and LVcc, an "L" only at LVcc, an "H" only at HVcc, and a " " not used. For example, Hammer is used only with at HVcc and LVcc. The minimal number of SCs, required to achieve the FC achieved with the initial 122 SCs, is only 26: 12 SCs at HVcc and 14 SCs at LVcc. Note that Scan was initially used with 4 SCs at HVcc and at LVcc. However, the impact of the stress on the FC at HVcc showed that only three SCs are required in order to achieve the same FC. At LVcc, Scan is not required (see also Table VII ). The required test time for the initial test set was 160.942 ms/chip, however, with the optimal test set, the required test time is just 30.498 ms/chip (i.e., a reduction factor of 5.3).
The previous clearly indicates the importance of test optimization and the overall test time reduction. Optimizing the test set means, in addition to selecting appropriate BTs, also selecting the minimal number of SCs that has to be associated with each BT in order to achieve the maximal FC.
VII. CONCLUSION
This paper presented the process of test set development for on-chip caches of Intel microprocessors, based on the kitchen sink principle. There are four main steps to develop an optimized test set: maximal test set generation, test application, test optimization, and optimal test set generation. This paper also discussed an example of implementing this process in a high volume manufacturing environment. The example shows the way to optimize the test time of an initial test set of 15 base tests, each with up to 16 stress combinations, resulting in a total of 122 tests. Test set optimization resulted in a minimal set of only six base tests instead of 15. In addition the test time has been reduced from a maximum of 160.942 ms/chip, to an optimal test time of just 30.498 ms/chip (a reduction factor of about 5.3).
