276,236 research outputs found
Identifying Mislabeled Training Data
This paper presents a new approach to identifying and eliminating mislabeled
training instances for supervised learning. The goal of this approach is to
improve classification accuracies produced by learning algorithms by improving
the quality of the training data. Our approach uses a set of learning
algorithms to create classifiers that serve as noise filters for the training
data. We evaluate single algorithm, majority vote and consensus filters on five
datasets that are prone to labeling errors. Our experiments illustrate that
filtering significantly improves classification accuracy for noise levels up to
30 percent. An analytical and empirical evaluation of the precision of our
approach shows that consensus filters are conservative at throwing away good
data at the expense of retaining bad data and that majority filters are better
at detecting bad data at the expense of throwing away good data. This suggests
that for situations in which there is a paucity of data, consensus filters are
preferable, whereas majority vote filters are preferable for situations with an
abundance of data
Binary Hypothesis Testing Game with Training Data
We introduce a game-theoretic framework to study the hypothesis testing
problem, in the presence of an adversary aiming at preventing a correct
decision. Specifically, the paper considers a scenario in which an analyst has
to decide whether a test sequence has been drawn according to a probability
mass function (pmf) P_X or not. In turn, the goal of the adversary is to take a
sequence generated according to a different pmf and modify it in such a way to
induce a decision error. P_X is known only through one or more training
sequences. We derive the asymptotic equilibrium of the game under the
assumption that the analyst relies only on first order statistics of the test
sequence, and compute the asymptotic payoff of the game when the length of the
test sequence tends to infinity. We introduce the concept of
indistinguishability region, as the set of pmf's that can not be distinguished
reliably from P_X in the presence of attacks. Two different scenarios are
considered: in the first one the analyst and the adversary share the same
training sequence, in the second scenario, they rely on independent sequences.
The obtained results are compared to a version of the game in which the pmf P_X
is perfectly known to the analyst and the adversary
Eliminating Redundant Training Data Using Unsupervised Clustering Techniques
Training data for supervised learning neural networks can be clustered such that the input/output pairs in each cluster are redundant. Redundant training data can adversely affect training time. In this paper we apply two clustering algorithms, ART2 -A and the Generalized Equality Classifier, to identify training data clusters and thus reduce the training data and training time. The approach is demonstrated for a high dimensional nonlinear continuous time mapping. The demonstration shows six-fold decrease in training time at little or no loss of accuracy in the handling of evaluation data
Automating Coreference: The Role of Annotated Training Data
We report here on a study of interannotator agreement in the coreference task
as defined by the Message Understanding Conference (MUC-6 and MUC-7). Based on
feedback from annotators, we clarified and simplified the annotation
specification. We then performed an analysis of disagreement among several
annotators, concluding that only 16% of the disagreements represented genuine
disagreement about coreference; the remainder of the cases were mostly
typographical errors or omissions, easily reconciled. Initially, we measured
interannotator agreement in the low 80s for precision and recall. To try to
improve upon this, we ran several experiments. In our final experiment, we
separated the tagging of candidate noun phrases from the linking of actual
coreferring expressions. This method shows promise - interannotator agreement
climbed to the low 90s - but it needs more extensive validation. These results
position the research community to broaden the coreference task to multiple
languages, and possibly to different kinds of coreference.Comment: 4 pages, 5 figures. To appear in the AAAI Spring Symposium on
Applying Machine Learning to Discourse Processing. The Alembic Workbench
annotation tool described in this paper is available at
http://www.mitre.org/resources/centers/advanced_info/g04h/workbench.htm
Convolutional Analysis Operator Learning: Dependence on Training Data
Convolutional analysis operator learning (CAOL) enables the unsupervised
training of (hierarchical) convolutional sparsifying operators or autoencoders
from large datasets. One can use many training images for CAOL, but a precise
understanding of the impact of doing so has remained an open question. This
paper presents a series of results that lend insight into the impact of dataset
size on the filter update in CAOL. The first result is a general deterministic
bound on errors in the estimated filters, and is followed by a bound on the
expected errors as the number of training samples increases. The second result
provides a high probability analogue. The bounds depend on properties of the
training data, and we investigate their empirical values with real data. Taken
together, these results provide evidence for the potential benefit of using
more training data in CAOL.Comment: 5 pages, 2 figure
Data Dropout: Optimizing Training Data for Convolutional Neural Networks
Deep learning models learn to fit training data while they are highly
expected to generalize well to testing data. Most works aim at finding such
models by creatively designing architectures and fine-tuning parameters. To
adapt to particular tasks, hand-crafted information such as image prior has
also been incorporated into end-to-end learning. However, very little progress
has been made on investigating how an individual training sample will influence
the generalization ability of a model. In other words, to achieve high
generalization accuracy, do we really need all the samples in a training
dataset? In this paper, we demonstrate that deep learning models such as
convolutional neural networks may not favor all training samples, and
generalization accuracy can be further improved by dropping those unfavorable
samples. Specifically, the influence of removing a training sample is
quantifiable, and we propose a Two-Round Training approach, aiming to achieve
higher generalization accuracy. We locate unfavorable samples after the first
round of training, and then retrain the model from scratch with the reduced
training dataset in the second round. Since our approach is essentially
different from fine-tuning or further training, the computational cost should
not be a concern. Our extensive experimental results indicate that, with
identical settings, the proposed approach can boost performance of the
well-known networks on both high-level computer vision problems such as image
classification, and low-level vision problems such as image denoising
- …