Traditionally the position of reliability analysis in the design and production process of electronic circuits is a position of reliability verification. A completed design is checked on reliability aspects and either rejected or accepted for production. This paper describes a method to model physical failure mechanisms within components in such a way that they can be used for reliability optimization, not after, but during the early phase of the design process. Furthermore a prototype of a CAD software tool is described, which can highlight components likely to fail and automatically adjust circuit parameters to improve product reliability.
INTRODUCTION
At this moment quality and reliability analysis is, in many cases, used as a form of design verification. A completed (or partially completed) design is checked on whether it fulfils certain quality and/or reliability demands. For this purpose verification techniques (such as burn-in) or prediction techniques (such as part-count or failure-rate analysis) are quite often used. These methods do have a number of disadvantages:
1.
2.

3.
4.
They are not an integrated part of the design process. Improvements are usually introduced as 'add-ons' after completion of the design. A consequence of this approach is that in cases where certain reliability-or qualitydemands are not met the designer is often under pressure to add or change only the minimum required to fulfil the stated demands. They are not very accurate. Differences in value between prediction and practice of a factor 100 are not exceptional. They do not take into account differences in individual circuits within a batch. Our research results have shown that there can be considerable differences in reliability within a batch of circuits. Standard reliability prediction methods do not have a relation with commonly used design
.
parameters. Many traditional reliability prediction methods describe the reliability of a component as a function of a certain average stress. It is not possible to derive whether this average stress is relevant and, if it is, what 'causes' this stress. Techniques such as burn-in and analysis of field failures are, generally speaking, quite costly and time-consuming, and there is no guarantee that the real problem causes are found.
We shall now elaborate on a few of these points and then introduce a new approach to reliability improvement as an integrated part of the design process.
A FEW COMMENTS ON THE USE OF TRADITIONAL PREDICTION METHODS FOR RELIABILITY IMPROVEMENT
The reliability figures (e.g. failure rates) found in the main standard handbooks (e.g. MIL-HDBK217l, British Telecom HRD-4*) are usually derived by averaging data retrieved from vast numbers of failed components that operated in a variety of applications and under a wide range of environmental conditions. Provided that the sample sets are representative, such a post-mortem count provides accurate population mean estimates in which we can have a high statistical confidence. Considerable care should be taken, however, when applying these global means to lifetime predictions of circuits in specific (classes of) applications and operating under conditions that differ from the 'average'. Doing this would be like using the average age of fish to estimate the expected lifetime of trout swimming in a poisoned river. Therefore, the handbooks usually supply correction factors (n-factors), which cater for the influence on the basic failure rate (hb) of environmental conditions (e.g. temperature), specific operating modes (e.g. switching or analogue) and the quality and construction of the component itself:
As several physical failure mechanisms may each cause similar fatal damage,3 but with a very different application-dependent probability, further discrimination on the acceleration of failures is necessary for an accurate prediction. Otherwise we are prone to make the same mistake again: to use the 'global' mean rather than the mean of the subpopulation. However, an inevitable consequence of the above post-mortem approach is that detailed information on the circuit during its failure, is lost. (Especially blown-away components leave few traces to determine the cause of their damage.) So, correction factors to discriminate between different failure mechanisms cannot be derived and the failure rates published in the handbooks are in fact weighted averages of the rates corresponding to the individual contributing failure mechanisms.
Example
If a component, used in a variety of products, fails in 30 per cent of the cases due to mechanism A (mean time between failure (MTBF) = 10 years) and in all remaining cases due to mechanism B (MTBF = 1 year), then the failure rate found by post-mortem counting will be 1/(0-3 x 10 + 0.7 x 1) = 1 per 3.7 years. Although this figure is a statistically unbiased estimate of the mean failure rate, it is not close to either of the really occurring failure rates.
From this, it is clear that the misinterpretation of the handbooks for a specific application, rather than a class, is not likely to produce an accurate lifetime prediction.
More important, since these figures do not relate to failure mechanisms, they can hardly relate back to the physical entities (the stressors, e.g. currents I(t), voltages V(t), that influence these mechanisms. Whereas practice has shown that components may also fail due to extreme values of dlldt and dVldt, traditional handbooks only supply a somewhat misleading class( !) failure rate, which usually only depends on temperature and average power stress. 
Example (simplified, see also the next session)
In a fast switching bipolar transistor the pinch-in effect (a high local value of the collector-emitter current during switch-off) may give rise to extremely high local power dissipation causing the transistor to fail. The main actual failure cause of the breakdown mechanism is the high local value P(t) of the collector-emitter current on a particular spot within the device. The associated stressor on circuit level is the base terminal current slope dzb(t)/dt. For a reliable operation (with respect to the above failure mechanism), it should be avoided that d&(t)/dt frequently exceeds its upper and lower susceptibility limits.
The actual value of the susceptibility limit may be found through measurements or by applying a 'micro-functional' model (see the next section), which expresses local currents, voltages and powers within the device (e.g. P(t) in the example above) in terms of currents and voltages at device's terminals (e.g. dZ,(t)ldt). This enables simulation and optimization on circuit level, using standard circuit simulators like Spice. It should be noted that several factors complicate this simple interaction picture. For instance, the device temperature (partly depending on the environmental temperature T at circuit level) also influences the above failure mechanism, causing the transistor to fail at a slightly different value of the stressor dlb(t)/dt. So, we find that this failure mechanism in the device is in fact susceptible not to one, but a set of stressors, which at circuit level translate to dlb(f)/df, T and probably other stressors as well. A second complicating factor that is incorporated into the model of stressor/susceptibility interaction is the effect of inevitable tolerances in every production process, resulting in a batch of similar but not identical devices and components, and causing a certain spread on the susceptibility. To model this statistical quality, we can no longer speak of the susceptibility limit of a batch, but must assign a susceptibility distribution, transforming the susceptibility into a random variable.
Inaccuracies in the production process also cause spreads of the functional parameters (e.g. resistances, capacitances, gains) of components, and thus of circuit currents and voltages as well. As many of these circuit entities are in fact stressors, it becomes apparent that stressors too are random variables with a probability distribution.
Stressor and susceptibility densities may shift and widen due to drift and ageing.
Furthermore, since a device quality or parameter can determine both the susceptibility and the functional performance, we can expect stressors and susceptibility to be highly correlated.
It is obvious that for an accurate prediction of component failures in a batch, all essential stressors and all associated susceptibilities should be taken into account. The calculations involved will be too complicated to do by hand, but simulation on a computer is feasible, as will be shown in the last section of this paper. Using additional post-processing the program described there can show where in the circuit failures are likely to occur and which parameters have a dominant influence on these failures. It can also give guidelines for improvement. Using this simulation technique it is not only possible to prevent potential quality and reliability problems during the design phase, but also to give a design a certain 'robustness' against possible (unexpected) external influences.
Practical example
To demonstrate the use of reliability optimization using stressor/susceptibility interaction this paper will use a practical example (see Figure 1 ). This test circuit is a simplified high-voltage switching circuit.
DEVELOPMENT OF on, or verified by, practical measurements. Owing to limitations in resources and equipment, the development of complete statistical models has not yet been possible. However, it was often possible to derive 'safe' (i.e worst-case) susceptibility limits.
The following section will illustrate the development of susceptibility models using the practical example of one of the failure mechanisms related to secondary breakdown in a bipolar transistor (transistor HV in Figure 1) . A detailed discussion of the physical aspects of second breakdown is given by H~mphreys.~* As second breakdown effects are closely related to geometrical aspects of transistors first a brief explanation of the effect of the geometrical transistor structure on the switching behaviour will be given.
Theoretically, a transistor is often assumed to be a homogeneous device having one emitter, one base and one collector. The behaviour in all parts of these terminals is assumed to be identical.
The problem in this respect is especially the construction of the base of a transistor. See Figure  2 for a cross-section diagram of a simple n+pnn+ transistor. Owing to the ohmic effects of the base channel combined with the effects of the base-collector capacity the base of the transistor will not behave homogeneously but will show considerable differ- During turn-off the collector-emitter current becomes concentrated towards the middle of the emitter area. The charge, stored in the transistors, is removed first at the edges and later in the middle of the base channel ( Figure 3) . As a consequence the current through the transistor will 'pinch-in' in the middle of the emitter area. This effect is quite similar to the high reverse current in diodes immediately after re-polarization.
How this pinch-in effect affects transistor behaviour can be demonstrated using a square planar transistor (Figure 4) . It is possible to simulate the behaviour of a large, inhomogeneous transistor using a micro-functional model. This micro-functional model consists of an array of small homogeneous transistors, and models the effects of nonhomogeneous switching using a base network (see Figure 5 ) .
Switching the transistor off will result in a power distribution such as given in Figure 6 . From this Figure we can easily derive that especially the interior of the emitter area of the transistor is susceptible to pinch-in effects. These localized power peaks can cause failures in the transistor if they exceed a critical value, which can be ~alculated.~ A stresssorhsceptibility model, describing this failure mechanism, assumes a relation between the localized power dissipation and the total device power dissipation at circuit level. Although the development of complete statistical susceptibility models for batches of components has not yet been possible, the development of these models deserves attention for further research. At present manufacturers' databooks tend to present only comparatively simple operation guidelines without distributions and often even without distribution limits. The introduction of detailed susceptibility models in component manufacturers' databooks appears to be a useful enhancement of these books and gives additional insight to the user of these databooks. An additional advantage of a manufacturer providing susceptibility models is the possibility of implementing these models directly in a computer-aided design system, thus reducing the time required to obtain results.
After studying the failure mechanism in detail and determining the physical entities and their critical values within the device, the second part of modelling stressor/susceptibility interaction involves the translation up to circuit level. For our example this means finding out which stressors (and stressor combinations) contribute to power peaks during turn-off.
The combination of a constant current Z, and a transistor already partially switched off (resulting in an increasing collector-emitter voltage Vce) will result in a considerable increase in power density.
Hence the combination Zc/Vce forms a stressor pair.
One of the solutions to prevent pinch-in effects seems to be a rapid discharge of the transistor base. There is, however, an important limitation in this respect. A very rapid discharge of the transistor base will cause a remaining 'charge bubble' under the middle of the transistor's emitter area. Rapid discharging may cause a complete charge removal at the edges of the transistor's base channel. In those areas where charge is completely removed the lateral conduction of the base channel drops, thus leaving a remaining charge under the middle of the transistor's emitter area. Therefore it is important that the base discharge rate remains close to an optimum.
Together this gives the stressors in Table I for reverse-bias second breakdown.
DETERMINING STRESSOWSUSCEPTIBILITY INTERACTION
Generally speaking there are two methods of obtaining actual stressor sets for a given component: measuring stressors and obtaining stressor sets from the results of computer simulation. As mentioned in one of the previous sections the latter option requires the availability of 'micro-functional' models. Unfortunately these are often unavailable. Although many models at circuit level are available for programs such as Spice and Philpac there is, at present, no generally accepted micro-functional model for the more complex multi-parameter devices (such as diodes, transistors, etc.). Another problem is that the available models usually do not describe the (often highly correlated) tolerance effects in components. These tolerance effects (spreads) are essential for the simulation of batchreliability (normally, it is not the average circuit that fails, but one that is close to its tolerance limits). Although many circuit simulators have limited possibilities for introducing parameter tolerances, these tolerances are in practice hardly known, and correlations between parameters are often not known at all. Therefore a considerable effort was put into the development of more comprehensive tolerance models, especially for multi-parameter components (e.g. transistors). An analysis of two practical circuits3 showed that, for circuits such as presented in the example, the majority of the reliability problems was related to extremes in the stressor function. Figure 7 describes all possible combinations of collector-emitter voltages for a batch of circuits (the other stressors described in the stressor set are taken into account but not displayed in this Figure) . The more inner contours express a higher probability of occurrence of a given combination of voltage and current. The border of the shaded area expresses the combined susceptibility limits for the second breakdown mechanism as well as limitations on average power stress. 
STRESSOR/SUSCEF'TIBILITY BATCH OPTIMIZATION
The stressor/susceptibility method is implemented as an extension to an existing CAD software tool, called MINNIE.5 The implementation is based on a Monte Carlo simulation, in which probability density functions (including correlations) are assigned to both the functional parameters and the susceptibility limits of the design's components. Values are randomly picked according to these densities to produce a representative sample set, that will mimic the reallife batch-manufactured product. For each generated sample circuit an analysis (AC, DC, transient) is done using a circuit simulator (e.g. Spice). In every sample circuit and at every simulated time and/or frequency point the actual susceptibility limits are compared against the associated stressor values, and violations (circuit failures) are counted. This can be depicted (Figure 8 ) by superimposing the susceptibility density as a band onto the results graph and printing a violation count (not shown in Figure 8 ) at every time/frequency point.
To investigate (at one particular time/frequency point), which parameter values tend to make a circuit fail, we could make a scatter plot (Figure 9 ) for each parameter. Each dot in the shaded area of Figure 9 represents one failing circuit (for clarity and simplicity the susceptibility limits in this Figure are taken to be fixed). A similar useful plot that can give much insight into failure causes, is the socalled pass-fail diagram? which can be set up with respect to one time/ frequency point as well as to any range. It consists of two (or three) superimposed histograms with the parameter of interest on the common horizontal axis (Figure 10 ; the sample value is expressed as a fraction of the nominal). All parameter samples associated with circuits (just) passing all susceptibility constraints in the investigated time/frequency range are 'binned' in the pass and 'critical' histograms (grey bars). All the other parameter values end up in the fail histogram (black). A large distance between the centres (means) of the pass and fail histograms (i.e. a small histogram overlap) indicates that the reliability is sensitive to this parameter (left parameter (Ll) in Figure 10 ). Its nominal should be moved towards the centre of the passing circuits (to the right in this case) to improve product reliability. If the pass and fail histograms overlap considerably (as in the right-hand side pass-fail diagram of Figure  lo) , then the reliability is insensitive to this parameter and adjusting the nominal is useless. When fails occur on both sides of the nominal (middle pass-fail diagram), the only possibility for improvement is to narrow the tolerances.
To capture this picture numerically, a statistical test (e.g. Student's r-test) may be applied to determine which parameters have a centre of passes that is a significant distance away from the centre of fails. This enables us to pinpoint those parameters that are dominant for circuit failure. We can highlight the associated components in the circuit diagram on the computer screen, or print their names in the results graph at the position where the violation (failure) occurs (Figure 8 ).
As explained above, it is likely that the reliability of the product will benefit from a change of each (designable) parameter's nominal into the direction of its centre of passes [Note that only the par- ameter's nominal is adapted, not its quality (tolerance), so the operation does not increase component costs]. This process is called design centring. Several algorithms have been developed' for automated design centring on the functional performance of the product. For this functional design centring the specifications are usually fixed upper and lower limits on output voltages, impedences, power dissipation etc. In our case of design centring (with respect to reliability), the 'specifications' are limits on the stressor values and they are not fixed, but dictated by the component susceptibilities, which are random variables. So, every sample circuit has different values for the stressors and for the susceptibility limits. For our purpose we have implemented a modified version of the centres of gravity algorithm,' because of its proven robustness and the fact that it can handle a high number of components without losing accuracy or getting problems with convergence.
