28 research outputs found
Estimation of error rate for linear discriminant functions by resampling: Non-Gaussian populations
AbstractThis article presents simulation results comparing various resampling estimators of classification error rate for linear discriminant type classification algorithms. Three non-Gaussian multivariate populations are studied namely, exponential, Cauchy and uniform. Simulations are conducted for small sample sizes, two-class and three-class problems and 2-D, 3-D and 5-D distributions. Estimation procedures and sample sizes are the same as in our previous study of Gaussian populations; again 200 bootstrap replications are used for each simulation trial. For exponential and uniform distributions the 0.632 estimator generally performs best. However, for Cauchy distributions the convex bootstrap and the e0 often outperform the 0.632 estimator
Sampling theory methodology applicable to data validation studies
In data validation studies, surveys are conducted to obtain information about the data collection process and the uses of the data. In many cases standard sampling techniques can be used. Two methods, stratified random sampling and cluster sampling, were used for surveys in the Form 4 data validation study. Form 4 is a data collection system on monthly generation and consumption of fuels by electric power plants. A description of those applications is given. Sometimes time and cost constraints make more sophisticated controlled sampling approaches necessary. One such approach using balanced incomplete block designs is described; an appendix surveys the existence results for these designs. Sequential methods which may prove to be more cost effective are discussed, as are sequential approaches to the problem of determining the size of a population. Problems requiring further research are also discussed. Some preliminary results on the problem of stratification with respect to more than one variable are included. The results were obtained for the Form 4 respondent population. The Form 4 study indicated that standard statistical sampling methods could be useful in data validation surveys. For example, at least 30 percent of the respondents do not report net generation as the instructions define it, and only 25 percent of the state regulatory agencies use the Form 4 data. Such inferences were possible only because statistical sampling procedures were used. 3 tables
Influence function and its application to data validation
Hampel's influence function has been used by Devlin, Gnanadesikan, and Kettenring to detect bivariate observations which have unusual influence on estimates of correlation. In the validation of energy data systems such observations may sometimes be considered to be outliers. The identification of such outliers may be valuable for the detection of errors in a data base. When data are used in regression equations, those observations which have the greatest effect on the multiple correlation coefficient or the regression coefficients are of interest. The contours of constant influence are derived for the multiple correlation coefficient in the case of regressing two variables on a third. In some problems the analytic form of the influence function may be difficult to derive. In such cases the empiric estimator of the influence function, as proposed by Mallows, may be useful for detecting outliers. For FPC form 4 power plant data, the correlation between generation and consumption is a parameter of interest to users of the data. Estimates of the contours of constant influence were determined and used to detect outliers with respect to bivariate correlation
Recommended from our members
Technique for detecting a small magnitude loss of special nuclear material
The detection of losses of special nuclear materials has been the subject of much research in recent years. The standard industry practice using ID/LEID will detect large magnitude losses. Time series techniques such as the Kalman Filter or CUSUM methods will detect small magnitude losses if they occur regularly over a sustained period of time. To date no technique has been proposed which adequately addresses the problem of detecting a small magnitude loss occurring in a single period. This paper proposes a method for detecting a small magnitude loss. The approach makes use of the influence function of Hempel. The influence function measures the effect of a single inventory difference on a group of statistics. An inventory difference for a period in which a loss occurs can be expected to produce an abnormality in the calculated statistics. This abnormality is measurable by the influence function. It is shown that a one period loss smaller in magnitude than the LEID can be detected using this approach