5 research outputs found
Block-regularized 52 Cross-validated McNemar's Test for Comparing Two Classification Algorithms
In the task of comparing two classification algorithms, the widely-used
McNemar's test aims to infer the presence of a significant difference between
the error rates of the two classification algorithms. However, the power of the
conventional McNemar's test is usually unpromising because the hold-out (HO)
method in the test merely uses a single train-validation split that usually
produces a highly varied estimation of the error rates. In contrast, a
cross-validation (CV) method repeats the HO method in multiple times and
produces a stable estimation. Therefore, a CV method has a great advantage to
improve the power of McNemar's test. Among all types of CV methods, a
block-regularized 52 CV (BCV) has been shown in many previous studies
to be superior to the other CV methods in the comparison task of algorithms
because the 52 BCV can produce a high-quality estimator of the error
rate by regularizing the numbers of overlapping records between all training
sets. In this study, we compress the 10 correlated contingency tables in the
52 BCV to form an effective contingency table. Then, we define a
52 BCV McNemar's test on the basis of the effective contingency table.
We demonstrate the reasonable type I error and the promising power of the
proposed 52 BCV McNemar's test on multiple simulated and real-world
data sets.Comment: 12 pages, 6 figures, and 5 table
Estimating the Confidence Interval of Expected Performance Curve in Biometric Authentication Using Joint Bootstrap
Evaluating biometric authentication performance is a complex task because the performance depends on the user set size, composition and the choice of samples. We propose to reduce the performance dependency of these three factors by deriving appropriate confidence intervals. In this study, we focus on deriving a confidence region based on the recently proposed Expected Performance Curve (EPC). An EPC is different from the conventional DET or ROC curve because an EPC assumes that the test class-conditional (client and impostor) score distributions are unknown and this includes the choice of the decision threshold for various operating points. Instead, an EPC selects thresholds based on the training set and applies them on the test set. The proposed technique is useful, for example, to quote realistic upper and lower bounds of the decision cost function used in the NIST annual speaker evaluation. Our findings, based on the 24 systems submitted to the NIST2005 evaluation, show that the confidence region obtained from our proposed algorithm can correctly predict the performance of an unseen database with two times more users with an average coverage of 95\% (over all the 24 systems). A coverage is the proportion of the unseen EPC covered by the derived confidence interval
A Multitask Learning Approach to Document Representation using Unlabeled Data
Text categorization is intrinsically a supervised learning task, which aims at relating a given text document to one or more predefined categories. Unfortunately, labeling such databases of documents is a painful task. We present in this paper a method that takes advantage of huge amounts of unlabeled text documents available in digital format, to counter balance the relatively smaller available amount of labeled text documents. A Siamese MLP is trained in a multi-task framework in order to solve two concurrent tasks: using the unlabeled data, we search for a mapping from the documents' bag-of-word representation to a new feature space emphasizing similarities and dissimilarities among documents; simultaneously, this mapping is constrained to also give good text categorization performance over the labeled dataset. Experimental results on Reuters RCV1 suggest that, as expected, performance over the labeled task increases as the amount of unlabeled data increases
Benchmarking non-parametric statistical tests
Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions.