Introduction
A hard error is caused by a permanent physical defect while a soft error usually refers to a transient defect which is caused by thermal neutrons, high-energy neutrons, or α particles. May first discovered that α particles emitted from radioactive substances caused soft errors in DRAM modules [3] . The feature size of integrated circuits has reached nano-scale and the nano-scale transistors become more softerror sensitive. Soft error estimation and highly-reliable design become of utmost concern in mission-critical systems as well as consumer products.
Recently, several techniques for estimating soft errors have been proposed. In [1, 5, 11] , soft error simulation with fault injection is discussed. In [1, 4] , a soft error estimation technique is discussed for microprocessors. Soft error simulation in logic circuits are also being studied and developed [7] [8] [9] [10] . The structure of memory modules is so regular and monotonous that it is comparatively easy to estimate the soft errors because they can be calculated with the SERs which are obtained by field or accelerated tests. These SERs of memory modules become pessimistic when they are embedded into computer systems. More specifically, every soft error occurred in memory modules is regarded as a critical error when memory modules are under field or accelerated tests. This implicitly assumes that every soft error on memory cells of a module make a computer system faulty. Since memory modules are used spatially and temporally in computer systems, every soft error on the memory modules does not necessarily make the computer system faulty. Therefore, the soft errors in an entire computer system should be estimated in a different way from the way used for memory modules.
Accurate soft error estimation of an entire computer system is one of the themes of urgent concern. This paper discusses an accurate soft error estimation technique which can be adopted early in development of computer systems. It is necessary to consider the dynamic behavior of computer systems for accurately estimating the reliability of them.
The duration during which a memory module or a logic circuit is being used depends on the behavior of a computer system programmed with software and affects its reliability. It is hard to estimate whether or not a soft error in memory is critical to the computer system without knowing its behavior. The complicated structure of the hierarchical memory modules which are utilized in computer systems makes the soft error estimation of them harder. In such a computer system, it is hard to judge which part of memories is active to operate the computer system. Our soft error estimation technique computes the number of soft errors on all memory modules of the computer system by cycle-accurate instruction-set simulation (ISS) with taking their dynamic behavior into account. To our best knowledge, this is the first study to model the dynamic behavior of computer systems for estimating the accurate number of soft errors during the execution of a program. The remainder of this paper is organized as follows: In Section 2, we review a model to estimate the number of soft errors on a word item in memory hierarchy. In Section 3, an soft error estimation model for instruction memory is discussed. The soft error estimation model for instruction memory is simple because instruction memory is basically read-only. In Section 4, a soft error estimation model for the number of soft errors in data memory is discussed. The estimation model for data memory is rather complicate because data memory is readable as well as writable. In Section 5, a simulation-based soft-error estimation algorithm for computer systems is discussed. Some experimental results are shown in Section 6. Section 7 concludes this paper with a summary.
Soft Errors on a Data Item
Unlike memory components, the soft error rates (SER) of computer systems vary every moment because the computer systems use memory modules spatially and temporally. Since only active part of the memory modules affects the reliability of computer systems, it is essential to identify the active part of memory modules for accurately estimating the number of soft errors occurring in the computer systems. A universal soft error index other than an SER is necessary to estimate reliability of computer systems because an SER is a reliability index suitable for components of regular and monotonous structure like memory modules but not for computer systems. In this paper, the number of soft errors to run a program is adopted as a soft error index for computer systems.
In computer systems, a word item is a basic element for computation in CPUs. A word item is an instruction item in an instruction memory while that is a data item in a data memory. A collective of word items is required to be processed in order to run a program. We consider the reliability to process all word items as the reliability of a computer system. The total number of soft errors which are expected to occur on all the word items is regarded as the number of soft errors of the computer system. This section discusses an estimation model for the number of soft errors on a word item.
A CPU-centric computer system typically has the hierarchical structure of memory modules which include a register file, cache memory modules, and main memory modules. The computer system at which we target has N mem levels of memory modules, M 1 , M 2 , · · ·, M N mem in order of accessibility from/to the CPU. In the hierarchical memory system, instruction items are generally processed as follows.
1. Instruction items are generated by a compiler and loaded into a main memory. The birth time of an instruction item is the time when the instruction item is loaded into the main memory, from the viewpoint of a program execution. 2. When an instruction item is required by the CPU, the CPU fetches the instruction item from the memory module closest to the CPU. The instruction item is duplicated into all levels of memory modules which reside between the CPU and the master memory module.
Note that instruction items are basically read-only. Duplication of instruction items are unidirectionally made from a low level of a memory module to a high level of a memory module. Data items in data memory are processed as follows.
1. Some data items are given as initial values of a program when the program is generated with a compiler. The birth time of such a data item is the time when the program is loaded into a main memory. The other data items are generated during execution of the program by the CPU. The birth time of the data item which is made on-line is the time when the data item is made and saved to the register file. 2. When a data item is required by a CPU, the CPU fetches the data item from the memory module closest to the CPU. If write allocate policy is adopted, the data item is duplicated at all levels of memory modules which reside between the CPU and the master memory module, and otherwise it is not duplicated at the interjacent memory modules.
Note that data items are readable as well as writable. This means that data items can be copied from a high level of a memory module to a low level of a memory module, and vice versa. In CPU-centric computer systems, data items are utilized as constituent elements. 
Word item d is required to be retained during Time valid time M i (d) in Memory Module M i to transfer to the CPU. The number of soft errors, error system (d), which occur from the birth time to the time when the CPU fetches is given as
where valid time M i (d) is necessary and minimal time to transfer the word item from the master memory module to the CPU, and depends on the memory architecture. This kind of retention time is exactly obtained with cycleaccurate simulation of the computer system.
Soft Errors in Instruction Memory
Each instruction item has its own lifetime while a program runs. The lifetime of each instruction item is different from one another and is not necessarily equal to the execution time of a program. Generally speaking, the birth time of instruction items is the time when they are loaded into main memory, from the viewpoint of a program execution. It is necessary to identify which part of retention time of an instruction item in a memory module affects reliability of a computer system. Now let us break down into the number of soft errors on an instruction item before the total number of soft errors in instruction memory is discussed. The ith instruction fetch of a CPU for an instruction item which resides at Address a is shown by i f (a, i). i f (a, 0) denotes the time when the instruction is loaded into the main memory. An example of some instruction fetches is shown in Figure 1 . In this figure, the boxes show that the copies of the instruction item resides in the corresponding memory modules. The labels on the boxes show when the copies of the instruction items are born. In this example, the instruction item is fetched three times to the CPU.
On the first instruction fetch for the instruction item, a copy of the instruction item exists in neither the L1 nor the L2 cache memories. The instruction item resides only on the main memory. It is required that the instruction item is transfered from the main memory to the CPU. On transferring the instruction item to the CPU, its copies are made in the L1 and the L2 cache memory modules. In this example, we assume that some latency is necessary to transfer the instruction item from the main memory module to the L1 cache, the L2 cache, and the register file. When the instruction item in a source memory module is fetched to the CPU, any soft errors which occur after completing transferring the instruction item have no influence on the instruction Proceedings of the 7th International Symposium on Quality Electronic Design (ISQED'06) fetch. In the figure, the boxes with slanting lines are found to be the retention times whose soft errors make Instruction Fetch i f (a, 1) faulty. The soft errors during any other retention times are unknown to make the computer system faulty.
On the second instruction fetch for the instruction item, the instruction item resides only in the main memory, same as on the first instruction fetch. The instruction item is fetched from the main memory to the CPU, same as on the first instruction fetch. The dotted boxes are found to be the retention times whose soft errors make Instruction Fetch i f (a, 2) faulty. The soft errors of any other retention times are unknown to make the computer system faulty on Instruction Fetch i f (a, 2). Note that the soft errors on the box with slanting lines in the main memory are already treated on Instruction Fetch i f (a, 1) and is not treated on Instruction Fetch i f (a, 2) in order to avoid counting soft errors duplicately.
On the third instruction fetch for the instruction item, the highest level of memory module which retains the instruction item is the L1 cache memory. 
The total number of soft errors in the computer system is shown as follows:
. Given the program of the computer system, valid time M j (i i ) can be exactly obtained by performing cycle-accurate simulation of the computer system.
Soft Errors in Data Memory
Data memory is readable as well as writable. It is more complex than instruction memory because word items are bidirectionally transferred between a high level of memory and a low level of memory. Some data items are given as an input to a program and the other data items are born during the program execution. Some data items are used and the others are unused even if they reside in memory modules. The soft errors during some retention time of a data item is influential in a computer system. The soft errors during the other retention time is not influential even if the data item is used by the CPU. A data item has valid or invalid part of time. It is quite important to identify valid or invalid retention time of a data item in order to accurately estimate the number of soft errors of a computer system. In this paper, valid retention time is sought out by using the following rules.
• A data item which is generated on compilation is born when it is loaded into main memory.
• A data item as input to a computer system is born when it is inputted to the computer system. • A data item is born when the CPU issues a store instruction for the data item.
• A data item is valid at least until the time when the CPU loads the data item and uses it in its operation.
• A data item which a user specifies as a valid one is valid even if the CPU does not issue a load instruction for the data item.
The bidirectional copies between high-level and lowlevel memory modules must be taken into account in data memories because the data memories are readable as well as writable. According to [2] , there are two basic options on cache hit when writing to the cache:
• Write through: the information is written to both the block in the cache and to the block in the lower-level memory. • Write back: the information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.
The write policies affects the estimation for the number of soft errors and should be taken into account.
Soft Error Model in a Write-Back System
A soft-error estimation model in write-back systems is discussed in this section. Let the ith store operation of a CPU at Address a be s(a, i) and the jth load operation at Address a be l(a, j) . An example of the behavior of a writeback system is shown in Figure 2 . Each box in the figure shows the existence of the data item in memory modules. The labels on the boxes show when the data items are born. In the example, two store operations and two load operations are executed. First, a store operation is executed and only the L1 cache is updated with the data item. The L2 cache or main memory is not updated with the store operation. A load operation on the data item which resides at Address a follows. The data item resides in the L1 cache memory and is transferred from the L1 cache to the CPU. The soft errors on the boxes with slanting lines are found to be influential in reliability of the computer system by the issue of Load l(a, 1). The other boxes with Label s(a, 1) is unknown to be influential in the reliability. Next, the data item in the L1 cache goes out to the L2 cache by the other data item. The L2 cache memory becomes the highest level of memory which retains the data item. Next, a load operation l(a, 2) is issued and the data item is transferred from the L2 cache memory to the CPU. With the load operation l(a, 2), the soft errors on the dotted boxes are found to be influential in reliability of the computer system. Soft errors on the white boxes labeled as s(a, 2) are not counted on Load l(a, 2).
Soft Error Model in a Write-Through System
A soft-error estimation model in write-through systems is discussed in this section. An example of the behavior of a write-through system is shown in Figure 3 . First, a store operation at Address a is issued. Write-through architecture makes multiple copies of the data item in the cache memories and the main memory. Next, a load operation follows. The CPU fetches the data item from the L1 cache and soft errors on the boxes with slanting lines are found to be influential in reliability of the computer system. Next, a store operation s(a, 2) comes. The previous data item at Address a is overridden and the white boxes labeled as s(a, 1) are no longer influential in the reliability of the computer system. Next, the data item in the L1 cache is replaced with the other data item. The L2 cache becomes the highest level of memory which has the data item of Address a. Next, a load operation l(a, 2) follows and the data item is transferred from the L2 cache to the CPU. With Load l(a, 2), soft errors on the dotted boxes are found to be influential in the reliability of the computer system.
Simulation-Based Soft Error Estimation
As discussed in the previous sections, the retention time of a word item in memory modules needs to be obtained so that the number of soft errors in a computer system is es- timated. A cycle-accurate Instruction-Set Simulation (ISS) is executed in order to obtain the retention time of a word item. A simplified algorithm to estimate the number of soft errors for a computer system to finish a program is shown in Figure 4 . The input to the algorithm is an instruction sequence, and the output from the algorithm is the accurate number of soft errors, error system , which occur during program execution.
First, several variables are initialized. error system is initialized with 0. The birth times of all data items are initialized with the time the program starts. A for-loop sentence follows. A cycle-accurate ISS is executed in the for-loop. An iteration loop corresponds to an execution of an instruction. The number of soft errors is counted for every instruction item and is added to error system . If error system is updated, the birth time of the corresponding word item is updated with the present time. Some computation is done if the present instruction is a store or a load operation. If the instruction is a load operation, the number of soft errors on the data item which is found to be influential in the reliability of the computer system are added to error system . A load operation updates the birth time of the data item with the present time. If the instruction is a store operation, the birth time of all word items changed and is updated with the present time. After the above procedure is applied to all instructions, error system is outputted as the number of soft errors which occur during a program execution.
Experiments and Results

Experimental Setup
We targeted a microprocessor-based system consisting of an ARM processor (ARMv4T, 200MHz) , an instruction cache module, and a data cache module, and a main memory module as shown in Figure 5 . The cache line size and the number of cache-sets are 32-byte and 32, respectively. We adopted the Least Recently Used (LRU) policy as the cache replacement policy. We evaluated reliability of computer systems with the two write policies, write-through and write-back ones. The cell-upset rates of both SRAM and DRAM modules are shown in Table 1 . We used the cellupset rates shown in [6] as the cell-upset rates of non-ECC SRAMs and non-ECC DRAMs. We used three benchmark programs: Compress version 4.0 (Compress), JPEG encoder version 6b (JPEG), and MPEG2 encoder version 1.2 (MPEG2). We used the GNU C compiler and debugger to generate address traces. We chose to execute 100 million instructions in each benchmark program. This allowed the simulations to finish in a reasonable amount of time. All programs were compiled with "-O3" option. Table 2 shows the code size in words, active code size and active data size for each benchmark program. The active code and data sizes represent the number of different instruction and data addresses which are accessed during the execution of 100 million instructions, respectively. Figure 6 shows the results of our soft error estimation method. Four different memory configurations were considered as follows:
Experimental Results
1. non-ECC L1 cache memory and non-ECC main memory, 2. non-ECC L1 cache memory and ECC main memory, 3. ECC L1 cache memory and non-ECC main memory, 4. and ECC L1 cache memory and ECC main memory.
The vertical axis represents the number of soft errors occurred during the execution of 100 million instructions. The horizontal axis represents the number of ways in a data cache. The other cache parameters, i.e., the line size and the number of lines in a cache way, are unchanged. The size of the data cache is, therefore, linear to the number of cache ways in this experiment. The cache sizes corresponding to the values shown on the horizontal axis are 1kB, 2kB, 4kB, 8kB, 16kB, 32kB, and 64kB, respectively.
The number of soft errors which occurred during a program execution depends on the reliability design of the memory hierarchy. When the cell-upset rate of SRAMs was higher than that of DRAMs, the soft errors on cache memories became dominant in the whole soft errors of the computer systems. The number of soft errors in a computer system, therefore, increased as the size of cache memories increased. When the cell-upset rate of SRAMs was lower than that of DRAMs, the soft errors on main memories became dominant in the system soft errors in contrast. The number of soft errors in a computer system, therefore, increased as the size of cache memories decreased because the larger size of cache memories reduced runtime to finish a program. Table 3 shows the number of CPU cycles to finish executing the 100 million instructions of each program. Table 4 shows the results of more naive approaches which calculated the number of soft errors using the following equations. (7) where S cache , S code , AS code , AS data , N cycle , S ER S , S ER D denote the code size, the active code size, the active data size, the cache size, the number of CPU cycles, the SER for SRAM per word per cycle and the SER for DRAM per word per cycle, respectively. M1 and M2 appeared in Table 4 correspond to the calculations using Equations (6) and (7), respectively. Our method corresponds to M3. It is obvious that the naive approaches resulted in large overestimation of soft errors. This indicates that accumulating soft error rates of every memory modules in a system resulted in pessimistic estimation.
Conclusion
This paper discussed the simulation-based soft error estimation technique which sought the accurate number of soft errors for a computer system to finish running a program. Unlike the SERs of memory components, it is not important to seek for the static SER of a computer system in a conventional way, that is the simple summation of the SERs of memory modules. An important point to emphasize is that seeking for the number of soft errors to run a program is essential for accurate soft-error estimation for computer systems. We estimated the accurate number of soft errors of the computer systems based on ARM V4T architecture. The experimental results showed the following facts.
• There was a great difference between the number of soft errors derived with our technique and that derived from the naive summations of the static SERs of memory modules. The behavior of computer systems must be taken into account for accurate reliability estimation.
• The SER of a computer system virtually increases with a larger cache memory adopted because the SER is calculated by summing up the SERs of memory modules utilized in the system. It was, however, found that the number of soft errors to finish a program was reduced with larger cache memories in the computer system which had an ECC L1 cache and a non-ECC main memory. This is because the soft errors in cache memories were negligible and the retention time of data items in the main memory was reduced by the performance improvement.
There are several issues with regard to the reliability of computer systems. Future work includes
• reliable design which determines the memory hierarchy of computer systems with our reliability estimation technique, • design methodology which can treat a tradeoff between reliability and power consumption. Proceedings of the 7th International Symposium on Quality Electronic Design (ISQED'06)
