One very undesirable effect that quite often occurs when training neural network classifiers, is that the same classifier does not seem to attain the same performance when trained several times, even when (pretty much) the same training data is used. The larger the variation in performance, the less stable (or robust) the classifier seems to be. The stability of a (neural network) classifier can therefore be defined as the probability that the classification result is changed as a result of some small disturbance of the classifier. Such a small disturbance of the classifier can be due to a slightly different training set or to another initial weight setting. In general, it can be said that the stability of a neural network classifier depends on: • The structure/architecture (number of hidden neurons, number of hidden layers, etc.); • The scheme of training (including initial weight setting, order of presentation of training patterns each epoch, but also learning rate, momentum and number of training epochs); • The size and composition of the training set. Our primary aim is to derive a method (or methods) that can be used to tell us what kind of stability problems any given classifier might have on any given data set. We are especially interested in whether such methods can also work if only a limited amount of data is available. This, since we assume, that both the size of the training set and the size of the test set will have their (negative) impacts on any possible test outcome. Next we would of course like to now whether (and if so, how) any stability problems that might arise, can be avoided. In general the stability of a neural network classifier can be estimated by training a number of classifiers and determine the (for instance) the standard deviation over the attained performances on a test set; the higher the standard deviation, the more unstable the classifier seems to be. But not every set that can be used for testing gives the same result. If the test set is to small in size, or if (part of) the test set is also used to train the classifier, then the results are not representative for the classifiers 'real' performance or 'real' stability. As a result, if the amount of available data is too limited, it is simply impossible to tell whether any results that might indicate an unstable classifier are due to the limited size of the training set or the limited size of the test set. Increasing the amount data has a positive effect on the estimated stability. Using more data for training actually increases the classifiers stability (usually) whereas testing with more data, of course, only gives a more 'stable' estimate. In addition, some amount of instability can be due to the choice of initial setting of the classifier. But as it is usually infeasible to choose the best settings at forehand, there is not much that can be done about this. Even if we have used enough data and our classifier looks pretty stable we still might have a stability problem. Namely, our classifier might end up working on data with different class probabilities than the probabilities in our training and test sets. If then, this classifier happens to be sensitive to variations in the class ratios then it will most likely perform (far) below its performance, as we estimated it. We have designed a procedure to check in advance whether this might lead to any unexpected problems. We have done experiments on some, artificially generated data sets and according to our results, this problem does not occur if there is no overlap in our data sets and it appeared to cause much more problems on neural network classifiers with more than one hidden layer. But, besides choosing a proper network architecture there seems to not much that can be done in case of serious class sensitivity problems, other than making sure that the class ratios in the training and test sets match the 'real world' class ratios as good as possible.
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.