Architecture of an Intelligent Test Error Detection Agent by Kirmse, Matthias & Petersohn, Uwe
Architecture of an Intelligent
Test Error Detection Agent
Matthias Kirmse
Uwe Petersohn
Institut fu¨r Ku¨nstliche Intelligenz
TUD-FI12-02 Februar 2012-02-03
Technische Berichte
Technical Reports
ISSN 1430-211X
Fakultät Informatik
Technische Universität Dresden
Fakultät Informatik
D−01062 Dresden
Germany
URL: http://www.inf.tu−dresden.de/
Architecture of an Intelligent
Test Error Detection Agent
Matthias Kirmse and Uwe Petersohn
February 3, 2012
Abstract
In this paper we present the architecture of an intelligent test error detection
agent that is able to independently supervise the test process. By means of ratio-
nally applied bin and cause specific retests it should detect and correct the majority
of test errors with minimal additional test effort. To achieve this, the agent utilizes
test error models learned from historical example data to rate single wafer runs.
The resulting run specific test error hypotheses are sequentially combined with in-
formation gained from regular and ordered retests in order to infer and update a
global test error hypothesis. Based on this global hypothesis the agent decides if a
test error exists, what its most probable cause is and which bins are affected. Con-
sequently, it is able to initiate proper retests to check the inferred hypothesis and
if necessary correct the affected test runs. The paper includes a description of the
general architecture and discussions about possible test error models, the inference
approach to generate the test error hypotheses from the given information and a
possible set of rules to act upon the inferred hypothesis.
1 Introduction
High investments and short life-times related to semiconductor factories forces chip
manufacturers to optimize almost every step in their production processes to gainmax-
imum yield and minimal costs. Beside design and fabrication, the test of integrated
circuits plays a more and more important role, as the corresponding costs can make
up to 50% of the total costs [1]. Under this circumstances, it is a valuable approach to
improve the test-process using modern machine learning methods.
As test procedures have to cope with the growing complexity of todays IC’s, the un-
derlying test systems become similar prone to faults as production systems. In many
cases, test system faults cause test errors leading to functional devices being rated as
non-functional. While test system faults can be detected, for example, by the use of
reference wafers, erroneous die runs can only be corrected by corresponding retests.
Today, there are different standard strategies to detect test errors. First of all, the whole
test process is supervised by experienced test engineers. If they observe suspicious
2
wafer runs, they have different possibilities to check supposed test errors and narrow
their causes at the same time. For example, they can order wafer retests on other test
systems. Nevertheless, as a result of an increasing test number, it becomes difficult for
them to appropriately supervise all wafer runs.
Therefore, automatic methods are additionally used like defined bin limits based on
the experience of the test engineers. Their drawback is however that they only cover a
small fraction of obvious test errors and continuously require personal effort to update
the limits. Besides, regularly applied retests are another common automatic approach.
Thereby, selected wafers are tested twice in defined intervals. Significant differences in
both runs then indicate test errors and result in additional checks and retests of the in-
termediate wafer runs. Although this method is more reliable than static limits, several
wafer runs may be affected until a test system fault is detected leading to unnecessary
retests. Additionally, normal fail wafer retests are often not very effective as all fail
bins are retested despite the fact that in most cases only a small percentage of them are
really test errors.
To overcome these drawbacks we propose TED, a novel test error detection agent that
should supervise the test process in a similar way to normal test engineers. Based on
underlying test error models it assigns test and bin error probabilities to single wafer
runs and combines them with retest evidences to infer and continuously update valid
test error hypothesis. This should enable TED to detect test errors in an early stage,
make assumptions about their most probable cause and correct them effectively by bin
specific retests.
2 Test Errors and Retest Types
In this section we want to explain basic problem relevant terms that are used through-
out the paper. Thereby, we first describe basic test error definitions and subsequently
possible test and retest types.
Our test error definition is based on the correctness of measured bin classes, as they are
the final test result. Consequently, we consider a die run as erroneous if the measured
bin class differs from the correct one. Transferring this principle to wafer level, we de-
fine a wafer run to be erroneous if at least a percentage  of the corresponding die runs
are erroneous.  should reflect the cost-utility-limit of retests for the tested product.
In this paper, we roughly distinguish between three different test error types, test sys-
tem related test errors (TS), probe card related test errors(PC) and a third category
of miscellaneous test errors (MI). The first two types are self explainable. They are
permanent equipment faults, which require the current lot run to be interrupted and
continued on proper test equipment while the faulty part has to be repaired. Addition-
ally, all wafer runs that have been affected up to this point have to be retested in order
to correct the test errors.
On the other hand, the third category MI includes all test errors that are not caused
by permanent test equipment faults. For example, the contact of a needle can vary
depending on the time the last polish has been done. For dice, which are already at the
limit regarding contact sensitive tests, those little variations can make the difference
between fail and pass bin. In a nutshell, all cases of test errors for which a retest with
the same test equipment leads to an improvement belong to the third category.
In general, we consider the following types of tests and retests. Normally, each wafer
in a lot is at least tested ones. During this first wafer run, all dice on the wafer are tested
and assigned to a bin class. Additionally, there is the regular first wafer retest. Thereby,
each first wafer in a lot is tested again after all other wafers have been tested. During
this retest type only defined reticles on thewafer are tested again. If their are significant
differences between both wafer runs the affected lot goes into hold and is checked by
an engineer. Finally, fail wafer retests are ordered by test engineers if they suspect wafer
runs to be erroneous. This retest type normally includes the retest of all fail bins.
3 TED: Test Error Detection Agent
3.1 Agent Problem Description
According to [2], an agent problem can be described by four basic terms, perceptions,
actions, goals and environment. Basically, TED operates on lot level. This means that
the agent treats each lot independently from each other and does does not interchange
informations from different lot runs. For each lot run he perceives a sequence of single
wafer runs.
Based on this wafer run sequence, the agent has the choice between the following ac-
tions. First, he simply can order the next regular first wafer test including the test of
all dice. Second, he can initiate a fail wafer retest with the same test equipment, with
another probe card or on another test system. For each fail wafer retest, he additionally
can specify which bins should be retested.
The primary goal of our test agent is to detect and correct as many test errors as pos-
sible. Moreover, this goal should be achieved with minimal retest costs. Transferring
this overall objective to lot level, TED aims to reach a state where it is sure about the
correctness of the lot run and to achieve this again with minimal retest effort.
With respect to the terms in [2] the given test environment is inaccessible, nondeter-
ministic and discrete. It is inaccessible as we never know exactly the state of the test
system and the correctness of the single test runs but can only infer about them based
on uncertain evidences. It is nondeterministic as we do not know in advance what our
actions, for example retests, will yield. And it is discrete if we just use bin classes as
aggregated test results resulting in finite possible variations for single wafer runs.
3.2 Basic Principles
To pursue the main objectives TED has to be able to make rational decision about
retests in terms of necessity, possible equipment changes and retest bins. Consequently,
it has to infer internal beliefs about the existence of test errors, about the test error type
and the affected bins. In our model those beliefs are represented by a set of probabil-
ities, which together form the global test error hypothesis. This global hypothesis is
continuously updated during the lot run and is used both, to check the goal condition
and infer rational retest decisions.
h = (hTS; hPC ; hMI)
hTS = (pTS(TE); pTS(b1); :::; pTS(bk))
hPC = (pPC(TE); pPC(b1); :::; pPC(bk))
hMI = (pMI(TE); pMI(b1); :::; pMI(bk)) (1)
The global test error hypothesis for a lot consists of a separate hypothesis for each
test error type htype. They, in turn, include the probability ptype(TE) that the lot run is
affected by a test error of that specific type and the probabilities ptype(bi) reflecting the
belief of the agent that this bin is affected by the test error.
To infer these probabilities TED uses two kinds of information. First it applies special
test error models to each wafer run n resulting in model hypotheses mn. Both, the
separation into separate type specific hypotheses and the meaning of the single values
are equivalent to the global test error hypothesis. For example, mnTS(TE) reflects the
certainty of the test error model that the run n is affected by a test error caused by the
test system.
mn = (mnTS;m
n
PC ;m
n
MI)
mnTS = (m
n
TS(TE);m
n
TS(b1); :::;m
n
TS(bk))
mnPC = (m
n
PC(TE);m
n
PC(b1); :::;m
n
PC(bk))
mnMI = (m
n
MI(TE);m
n
MI(b1); :::;m
n
MI(bk)) (2)
The second information source are regular and fail wafer retests. If wafer run n is a
retest, TED additionally derives a retest hypothesis rn = (rnTSjrnPC jrnMI)with equivalent
probabilities. In contrast to the other both types of hypotheses, we can not necessarily
derive all test error type specific hypothesis from each retest. For example, from a retest
with the same test equipment we can only infer rnMI = (rnMI(TE); rnMI(b1); :::; rnMI(bk)),
while a retest with a different probe card allows us to draw conclusions about rnPC and
rnMI . Summarized, to infer the global test error hypothesis after wafer run n, we use
the previous global hypothesis together with the current model and potentially retest
hypotheses.
hn 1 mn  rn ! hn (3)
3.3 Agent Architecture
Figure 1 depicts the basic architecture of TED. As can be seen, TED consists of three
main parts, the test error models, the hypothesis module and the action inference mod-
Figure 1: General agent architecture
ule. During the processing of a lot run, they interact in the following way. First, each
wafer run is evaluated by the test error models resulting in a corresponding model hy-
pothesis mn. If it is a retest of a previous run, a proper retest hypothesis is generated
by the hypothesis module. In the next step, the hypothesis module updates the cur-
rent state based on the new informations, especially the global test error hypothesis as
indicated in formula (3). The state at each time step n is defined by the following tuple.
zn = (Mn; Rn; hn)
Mn = (m1; :::;mn)
Rn = (r1; :::; rn)
hn = (hnTS; h
n
PC ; h
n
MI)) (4)
Thereby, Mn is the sequence of generated model and Rn of inferred retest hypotheses
up to this point. ri can be empty if it has been no retest run.
Finally, the updated state is used by the action inference module to derive the next
action. The agent basically has the choice to continue the normal wafer run sequence
or order a retests. For a retest it additionally has to specify whether and how to change
the test equipment and which bins have to be retested. Section 3.6 describes in more
detail how the best action is derived from the current state.
3.4 Test Error Models
In general, test error models have to implement the mapping from wafer run test re-
sults to test error and bin error probabilities.
test results ! (m(TE);m(bi)) (5)
Thereby, it is possible to use different model types to generate different parts of the
hypothesis or use an model ensemble to improve the overall prediction accuracy. Nat-
urally, the agent performance depends strongly on the performance of the used test
error models.
In our proposed variant, we use two different model types to generate mntype(TE) and
the bin error probabilities mntype(bi). The first one is derived from a test error classifi-
cation model described in [3]. This approach includes a classification model ensemble
consisting of support vector machines, decision trees and Bayesian networks. The en-
semble is used to classify single wafer runs as correct or erroneous solely based on its
bin frequencies. Despite the fact that svm’s and decision trees are no probabilistic mod-
els, they can deliver confidences that can be interpreted as the probabilitiesmntype(TE).
Altogether, we have to learn three model instances, one for each test error type.
To obtain the second part of the model hypothesis, the bin error probabilitiesmntype(bi),
we developed a new probabilistic model. Unfortunately, is not yet published and we
can only give a brief overview due to space limitations. The approach is based on a
novel die probability model, which classifies die runs as correct or erroneous solely
based on their bins and the bins of adjacent dice on the wafer. To achieve this, the
underlying Bayesian network represents both, production caused bin relations of adja-
cent dice and test error influences on the die binning. Based on the die classification in
combination with the corresponding bins we can then estimate bin error probabilities
for the wafer run. Again, we have to learn three instances.
3.5 Hypothesis Module
The hypothesis module has two main tasks, first to map retest evidences to retest hy-
potheses and second to combine all given information to update the current state, es-
pecially the current global test error hypothesis.
If the current wafer run is a retest, the hypothesis module infers the test and bin error
probabilities by comparing it to the original wafer run. In general, the type of the retest
determines the conclusions we can draw from it. For example, retests with the same
test equipment only allow us to calculate rMI , but not to make conclusions about probe
card related test errors.
To calculate the concrete error probabilities we use the following equations.
r(TE) = max(1;
run error rate

)
r(bi) =
retest bin frequ(bi)
orig bin frequ(bi)
(6)
Thereby, run error rate is the percentage of erroneous die runs on the wafer that
changed their bin classes in the retest and  the error limit introduced in section 3.2.
The first equation reflects how near the original wafer run is to be considered as test
error with respect to  . Furthermore, retest bin frequ refers to the count of the bin bi
in the retest run and orig bin frequ to the corresponding count in the original run. The
second equation is an estimation for the real bin error probability as it represents the
number of die runs with a specific bin, which have been erroneous in the original run.
The primary task of the hypothesis module is to update the the overall test error hy-
pothesis with the current model hypothesis and a potential retest hypothesis. To com-
bine the corresponding single probabilities from the different hypotheses we use the
following sequential Bayesian update rule
p(xjen; :::; e1) = p(enjx) p(xjen 1; :::; e1)
p(en)
(7)
In general, x refers to a hidden variable for which do not know the state and en; :::; e1
are a sequence of evidences we have do estimate that state. In our case, the hidden
variables are the real test and bin error states of the lot run, represented by the global
hypothesis. From this point of view, pTS(TE) represents the probability that the lot
run is affected by a test system related test error. The available evidences in our special
domain are the probabilities delivered by the test error models and derived by retests.
The Bayesian update rule generally implies two assumptions, first that the state of the
hidden variable does not change over time and that the evidences are independent of
each other. Despite the fact that those strict assumptions does not completely hold in
the current like in most real world domains, we use this established update rule as in
most cases it yields sufficiently good estimates. Nevertheless, there are more complex
alternatives like hidden Markov models, which also represent state changes, or the
DempsterShafer theory [4] for the combination of evidences.
In order to apply the Bayesian update rule in our agent, we first have to determine the
likelihood functions p(enjx) and posterior probabilities p(x). As we don’t know them
theoretically, we have to estimate them based on a given example set. Concretely, we
have to learn ptype(mtype(TE)jTE), ptype(TE), ptype(mtype(bi)jbi) and ptype(bi) for all test
error types and bins. Thereby, the prior distributions ptype(TE) are discrete and can, for
example, be estimated by counting the corresponding value frequencies in the example
set. On the other hand, the prior bin test error probabilities ptype(bi) are set to one for
each new lot run. The reason is that those error probabilities are used to determine the
retest bins and we want all fail bins to be retested at least ones in the first retest of the
lot run. Before we can determine the likelihood probabilities, we first have to discretize
the model and retest probabilities. Then we can estimate the conditioned probabilities
again by counting corresponding value occurrences in the example set.
Finally, we want to give a short example for the update process. Assuming TED per-
ceives a retest run with the same test equipment. Then the hypothesis module would
execute two update steps. First, it would update the global test error hypotheses
hTS ,hPC and hMI with the corresponding model hypotheses mnTS ,mnPC andmnMI . More
precisely, it would, among others, update the probe card related error probability of
bin 5 at time step n in the following way.
pnPC(b5 jmnPC(b5)) =
pPC(m
n
PC(b5) j b5) pn 1PC (b5)
mnPC(b5)
(8)
Thereby, pnPC(b5jmnPC(b5)) is the update probability that in the lot run there is an probe
card based bin 5 test error based on the corresponding model estimation mnPC(b5).
pPC(m
n
PC(b5)jb5) is the likelihood that the model would deliver this probability if there
would be really this test error and pn 1PC (b5) is the prior belief of the test agent that this
test error exists. Finally, mnPC(b2) represents the general probability that the test error
model delivers this probability for this bin error.
In the second step the hypothesis module would equally update hTS with the retest
hypothesis rnTS . Apart from this one this retest does not allow other conclusion about
the equipment related test error probabilities.
3.6 Action Inference Module
The purpose of the action inference module is to map the current state to a possible
action. As already mentioned, that can be a regular first wafer test or a fail wafer retest
with specified bins and possible equipment changes.
8type ptype(TE) < type (9)
The lot specific goal of TED can partly be formalized by formula (9). It means that the
agent aims to reach a state where each wafer is at least tested once and the global test
error probability for each type is lower than a defined limit type.
A very simplified scheme of TED’s action policy is shown in figure 2. As long as
all global test error probabilities are below the corresponding limits TED continues
with the normal wafer run sequence. Only if they exceed the limits, the agent reacts.
Thereby, we distinguish between equipment related and miscellaneous test errors. For
supposed equipment related test errors the agent has to stop the lot run to limit the
number of possibly affected wafer runs. Depending on the test error type, the test is
then continued with another probe card or on another test system. The two remaining
questions are which wafer should be retested first in order to effectively check the test
error hypothesis and with which bins. Thereby, TED first retests the wafer run with the
highest corresponding error probability as it provides the most informations. That is
reasonable because in case that the retest shows no improvement, it is unlikely that the
other supposedly affected wafer runs would improve. Consequently, by updating the
global hypothesis with the retest evidence the test error would decrease significantly
Figure 2: Simplified action policy
and the agent would just returns to the normal wafer run sequence. Otherwise, in case
of a validated test error the agent would add all wafer runs with mntype(TE) > type as
additional retests before returning to the normal sequence.
The selection of the corresponding retest bins is based on the current global bin error
probabilities. For a fail wafer retest at time step n TED selects all bins with pntype(bi) 
type for retests. To ensure that for the first retest of a lot run all fail bins are retested it is
necessary to set their prior error probabilities p0type(bi) to one. Subsequently, continuous
update of this probabilities by model results and especially retest evidences will lead
to selective retests of only those bins with a high error probability.
MI test errors are handled differently from equipment test errors. For them it makes no
sense to interrupt the normal test sequence for an immediate retest of the supposedly
affected wafer run as the test equipment stays the same. Therefore, the retest can like-
wise be added after the normal test sequence is finished. For MI test errors it is equally
rational to retest first the wafer run with the highest corresponding error probability.
Also, continuously refined bin error probabilities lead to more selective and therefore
effective retests.
4 Related Work
A statistical approach for online test error detection can be found in [5]. It describes the
application of statistical process control methods to the test-process. Based on batches
of regularly executed re-tests, control charts are proposed, which represent the aver-
age difference between tests and retests of the selected parameters, i.e., the random
measurement error or repeatability. The authors assumed from experience, that a ”sig-
nificant change in the systematic error always correlates with a change in the random
error”. Therefore, an out of control situation of these control charts should give an
early hint at underlying systematic measurement errors.
In [6] a similar statistical approach is used to handle parallel measurement systems.
The authors developed a linear statistical model for this kind of systems, including the
process-variability, the tester-variability and a random error for each tester. Based on
batches of retests, similar to [5], they use common analysis of variance techniques to get
both the variance contribution of each single measurement instrument to the overall
variance and their variation over time. Control charts based on these criteria are used
to decide if variations in the resulting signals are due to process or test variations.
Therefore, they are able to identify faulty measurement instrument.
As both proposed methods are of statistical nature, they suffer from the same disad-
vantage, namely that a potential test system fault is found at the earliest when the
affected batch is completed. Besides, both methods use only univariate control charts,
which are less appropriate to find complex test error patterns than machine learning
methods, naturally handling multivariate feature spaces.
5 Conclusion
In this paper we proposed a new architecture for a test error agent that should be able
to effectively detect, diagnose and correct test errors. It incorporates learned knowl-
edge in the form of supervised machine learning models and local retest evidence to
build reliable hypotheses about wafer run and bin specific test errors. Subsequently,
those hypotheses are used to make rational decisions about when, how and what to
retest in order to find a maximal amount of test errors with minimal additional test
costs. Both, experimental studies and the application in a real test process will have
to show the performance of this agent approach. Thereby, the machine learning mod-
els, the belief update mechanism and the action inference approach are crucial for the
agent performance and consequently have to be examined in future studies.
References
[1] I. A. Grout, Integrated Circuit Test Engineering: Modern Techniques, 1st ed. Springer,
2005.
[2] S. J. Russell and P. Norvig,Artificial Intelligence: AModern Approach, 1st ed. Prentice
Hall, Jan. 1995.
[3] M. Kirmse, U. Petersohn, and E. Paffrath, “Application of machine learning meth-
ods to online test error detection in semiconductor test,” in Proceedings of World
Academy of Science, Engineering and Technology, vol. 69, Amsterdam, 2010, pp. 383–
389.
[4] G. Shafer, A mathematical theory of evidence. Princeton university press Princeton,
NJ, 1976, vol. 1.
[5] J. van der Peet and G. van Boxem, “SPC on the IC-Production test process,” Test
Conference, International, vol. 0, p. 605, 1996.
[6] H. Shu-guang, Q. Er-shi, and L. Li, “Study on the model of analysis and control
of parallel measurement systems,” in Management Science and Engineering, 2007.
ICMSE 2007. International Conference on, 2007, pp. 633–638.
