Abstract-Recent and projected advances in VLSI fabrication technology will allow for integration of billions of transistors and advanced architectures on a single chip. According to the International Technology Roadmap for Semiconductors (ITRS), widespread reliability challenges are expected for these VLSI fabrication technologies (65nm and below). Effective and efficient on-chip fault-tolerance solutions are needed. A new approach of achieving on-chip fault-tolerance using built-inself-test (BIST) resources is proposed in this paper. The proposed approach reduces production cost, implementation overhead and time-to-market; increases reusability, postfabrication reconfigurability and productivity; and is scalable across multiple VLSI processes and feature sizes. This will result in obvious advantages of yield enhancement and prolonged lifetime of VLSI chips as well.
INTRODUCTION
With the introduction of multiple changes and shrinking feature sizes in VLSI technologies, it has become possible to fabricate billion-transistor chips. Emerging paradigms utilizing these vast on-chip resources are system-on-chip and multi-core architectures [1] [2] . One of the major problems being faced by architects of these systems, is to achieve effective on-chip fault-tolerance without sacrificing too much area, energy and performance.
To further elaborate on the problem, according to ITRS, widespread reliability challenges are expected in near term VLSI fabrication technologies (65nm and below) because of the evolutionary changes in scaling current materials and devices [3] [4] and revolutionary changes associated with new materials and devices. The introduction of multiple materials, processes and structural changes in a short period will increase the difficulty of understanding and controlling failure modes. Therefore, fault-tolerance should be considered a necessity, rather than as a feature.
Traditional redundant logic (like Triple Modular Redundancy [5] ), arithmetic coding and algorithm-based fault tolerance (ABFT) approaches are limited in the type and number of faults [6] [7] they address, in addition to introducing hardwired performance and very high implementation overhead in designs (section IV). Traditional approaches will require another fabrication run for more and different types of faults, adding to the production time and hence time to market, not to mention total cost of the product. Thus, there is clearly a need for new fault tolerance techniques.
Our philosophy is to take unknown faults, less fault diagnosis and characterization time before production and unknown failure rates as a specification for providing a faulttolerance solution for future VLSI architectures. The above requirements translate into a programmable flexible faulttolerance solution. A post-fabrication programmable faultcoverage (FC) and fault-type fault-tolerance method is provided by the proposed approach of utilizing on-chip builtin self-test (BIST) resources. These BIST resources are expected to be already present on VLSI chips because they are becoming standard in the production environment. The proposed approach, in addition to its various other advantages mentioned in section 4, reduces production cost while leaving the trade-off analysis of maximizing FC and minimizing performance overhead to be programmable until post-fabrication.
Section II describes the proposed approach of BIST resource utilization, section III explores the design space by giving analysis dimensions to an interested designer, section IV elaborates on the advantages of proposed approach and finally section V concludes this paper.
II. PROPOSED APPROACH OF BIST UTILIZATION
The proposed approach is to perform fault-detection and isolation dynamically (during system operation) using available BIST resources. Our approach turns already present BIST resources into useful resources during a system's normal execution, i.e., BIST components gain more significance as dynamic resources in this new fault-tolerance paradigm. Our approach uses rollback and recovery mechanisms as an integral instrument to provide correct results in case of failures, and thus achieves fault tolerance. Rollback and recovery has been used before for achieving fault-tolerance at the software level of abstraction. The novelty of our approach is not merely using rollback and recovery at a hardware level, but giving BIST resources a new meaning of existence and more significance. Thus, rollback and recovery are just a facilitator in achieving this 1-4244-01 73-9/06/$20.00 ©2006 IEEE. purpose. The following paragraphs delineate this proposed new fault-tolerance paradigm along with several of the variants that one may consider for effective realization.
A. Basic Concept
The basic idea of the proposed approach is to periodically test system entities for permanent failures using BIST resources, and recover using a checkpoint of previous system intact state in case of any failure. This can be achieved by scheduling test and recover phases during system dynamic execution. After each test and recover phase, each system entity will commit its results and save a checkpoint, unless it fails the test or receives a recover message. and grain boundaries is crucial to minimize the overhead incurred in the recovery process. (Fig. 2) . Hierarchical configuration [8] to reduce wiring delay (due to long wires and large number of driver transistors) to logarithmic proportionality to number of transistors can be used to reduce the affect of a large number of drivers. This circuit itself should be tested for permanent faults, if redundancy is not provided for this small amount of logic. 
B. Concept of Grain
A grain is defined to be a block or collection of individual entities, which are rolled back and undergo recovery together as one unit upon finding any failure in that collection during a test and recover phase. Grains communicate with each other only after successfully committing their results (Fig. 1) . Grain size can be smaller or greater than a whole chip depending on a particular system architecture and application. The determination of grain size 2) Intergrain communication By definition of grain, no intergrain communication is allowed until a particular grain commits its computation and communication. There are two types of interface buffers that have to be used for intergrain communication. Output interface buffers ( Fig. 1 ) are required to save the intergrain communication for the current execution phase, and forward their contents to input interface buffers ( Fig. 1 ) of neighbor grains upon successful committing of current normal execution phase. These interface buffers can be implemented either as separate buffers or some part of local memory of system entities. In the latter case, with DMA support and more memory per node becoming standard, a host processor will not need to be interrupted for this task. Intergrain communication must be maintained in input interface buffers for the previous checkpoint of each associated grain. This information is required during recovery of a grain. Another intergrain communication issue is propagation of continue or stall signal. This can be achieved by using a multiphase extension of the circuit in Fig. 2 , e.g., by using domino logic distributed NOR gate.
Interface buffer size and injection of large interface buffer traffic into the network in normal execution phase are two issues here. Both can be handled by minimizing intergrain communication, which can be achieved by judicially defining the grain size and mapping application to grains of different shapes and sizes using chip partitioning, floorplanning and placement techniques used in traditional VLSI CAD.
D. Saving Checkpoints and Performing Recovery
After each successful test and recover phase, a checkpoint of system state is kept in an ECC and redundancy protected memory and circuitry. The system computation state information includes the value of all register files, value of program counter, processor status and mode protected registers, processor special purpose registers etc. System communication state involves saving all the packets in transit at the routers. In addition, external inputs received during the grain's normal execution before commit should be stored. This is important because during a recovery operation external inputs are difficult to re-create from outside the chip or system in general. This state saving approach is practical because this capability is inherent in computing elements to handle context switching. Although such capability has slightly higher memory requirements, processing nodes in future multicore architectures are projected to have abundant local memory anyway to alleviate the memory wall problem, which would become prohibitive otherwise for large number of on-chip cores [2] . So, proposed approach is not imposing any new restriction on the hardware architecture but it is making use of the resources that will already be available in next generations of processor architecture.
Recovery can be achieved in two ways: - 2) Overlapping oftest phase with normal execution.
Overlapping test and recover phase for certain grains with normal execution of other grains can be done to achieve one of the two purposes discussed below. This approach requires special consideration for the testing of inter-grain links because nodes at the other end of those links may be in normal execution phase. Thus, it requires some test functionality from some nodes in normal execution phase. a) For lower buffer requirements and more sustained performance
The objective here is to flatten out (or temporarily decompose) the burst of output data, and hence to reduce the buffering requirement at the outputs and achieve high sustained system performance. Here, sustained performance increases because some smaller numbers of entities give results at a shorter interval of time as compared to nonoverlapped case where all entities give results at the same time. Both worst-case latency and throughput are low initially as compared to non-overlapping case, but they reach their counterpart values in steady state at the end of the first run of testing of all grains (Fig. 3) . Notice that, for some scheduling policies (e.g. asymmetric time to enter test and recover phase and number of grains), there can be fluctuations in steady-state throughput also (it can be both higher and lower than non-overlapped case) at different times. But, the geometric mean of all fluctuating throughput values is the same as the non-overlapped case. For this case, the same restriction of intergrain communication only after committing of results is applied. Otherwise, correct results cannot be guaranteed because when the first round of testing (when a particular grain commits its results again) finishes for all grains, there is already a dependency (Fig. 3) This approach allows for the use of idle time for a subset of grains for testing and thus lowering impact of fault detection on system performance. In case of infrequent failures, this optimistic approach of overlapping will result in increased sustained performance and lower latency. Anyway, applications will be notified of erroneous results as soon as an error is detected and correct results will be provided again after recovery.
For this case, restriction of intergrain communication only after committing of results is removed and thus it makes grain boundaries soft to reduce the impact of grain size on system performance as compared to corresponding nonoverlapped separate test and recover phase case. Only starting execution from common checkpoint can ensure successful recovery by guaranteeing the successful and correct resolution of any intergrain communication. This approach is especially suitable for multimedia architectures since they can tolerate small error in information. In addition, this approach can be used in non-real time execution of large programs e.g. batch-processing jobs.
III. ANALYSIS DIMENSIONS A. Grain Size Determination
Grain size should be carefully chosen for a particular application (Fig. 4) . If grain size is big, then much energy has to be spent on recovery, in addition to its impact on intragrain communication. If grain size is small then a significant overhead is incurred in intergrain communication due to large number of interface buffers. The time interval or frequency for initiation of test and recover phases should be chosen carefully after analyzing its impact on system level performance (Fig. 4) . If the interval is too big then system latency and recovery time will be very large, which means increased recovery overhead and decrease in system performance. In addition, blocking of intergrain communication for a long time will also degrade system performance. If the time interval is very short, then also performance and energy overhead increases because much time and energy is spent in test and recover phase increasing the overall program execution time. Regarding area overhead, arithmetic code and ABFT techniques can only be applied to protect computations that possess certain algebraic structure [6] , and thus limiting the use of such techniques. So, complexity and overhead of building a generic system component e.g. CPU, router etc. can be very high and very unfriendly to generic hardware implementation by limiting resource sharing. TMR results in large area overhead for obvious reasons. Area overhead in our approach is minimal because BIST is becoming standard for manufacturer and boot-up tests, at least in production environments, and only small additional control circuitry is required.
IV. ADVANTAGES COMPARED
Regarding power consumption, redundancy based approaches require exact replication of hardware, as indicated above. Assuming roughly uniform switching activity factors, power comparisons track area comparisons. Thus, the proposed approach is expected to be less powerhungry as well.
C. Low time to market andproduction cost Low time to market is evident because of less number of chip respins, if any, required with the proposed approach. BIST resources can be configured to tune a wide range of potentially bad fabrication results to acceptable. Also, design time is low for the proposed approach because no design and fault environment specific mechanisms need to be hardwired like arithmetic codes. Low production cost is achieved because of less number of chip respins, low design time (and hence designer salary), high yield and long lifetime of chips.
D. High productivity because ofsmall learning curve.
BIST techniques are well developed and have already made their way into chip design for mass production. So, utilizing that knowledge for achieving fault tolerance, without learning many new concepts, will translate into higher designer productivity. For a comparison, productivity is inherently low with arithmetic code and ABFT techniques because of more complex design [6] , and redundancy based approaches will still require understanding of some extra concepts for efficient utilization of redundancy that were not part of the design flow before.
E. Reusability and scalability
The basic framework remains unaltered and highly reusable. Only various dimensions discussed in section IV need to be analyzed and a performance point needs to be tuned for a particular design and application. The proposed approach is also scalable across multiple process technologies and feature sizes, given the underlying framework of using BIST resources.
V. CONCLUSION According to ITRS, widespread reliability challenges are expected for recent and future VLSI fabrication technologies, a situation that calls for effective and efficient on-chip faulttolerance solutions. A new approach of achieving on-chip fault-tolerance using BIST resources is proposed in this paper. These BIST resources are expected to be already present on VLSI chip because they are becoming standard in the production components. The proposed approach reduces production cost, implementation overhead and time-tomarket; increases reusability, post-fabrication reconfigurability and productivity; and is scalable across multiple VLSI processes and feature sizes. This approach will result in obvious advantages of yield enhancement and prolonged lifetime of VLSI chips as well.
VI. REFERENCES

