9 research outputs found
Modeling Noise in Paraphrase Detection
Noisy labels in training data present a challenging issue in classification tasks, misleading a model towards incorrect decisions during training. In this paper, we propose the use of a linear noise model to augment pre-trained language models to account for label noise in fine-tuning. We test our approach in a paraphrase detection task with various levels of noise and five different languages. Our experiments demonstrate the effectiveness of the additional noise model in making the training procedures more robust and stable. Furthermore, we show that this model can be applied without further knowledge about annotation confidence and reliability of individual training examples and we analyse our results in light of data selection and sampling strategies.Peer reviewe
Generating the Ground Truth: Synthetic Data for Label Noise Research
Most real-world classification tasks suffer from label noise to some extent.
Such noise in the data adversely affects the generalization error of learned
models and complicates the evaluation of noise-handling methods, as their
performance cannot be accurately measured without clean labels. In label noise
research, typically either noisy or incomplex simulated data are accepted as a
baseline, into which additional noise with known properties is injected. In
this paper, we propose SYNLABEL, a framework that aims to improve upon the
aforementioned methodologies. It allows for creating a noiseless dataset
informed by real data, by either pre-specifying or learning a function and
defining it as the ground truth function from which labels are generated.
Furthermore, by resampling a number of values for selected features in the
function domain, evaluating the function and aggregating the resulting labels,
each data point can be assigned a soft label or label distribution. Such
distributions allow for direct injection and quantification of label noise. The
generated datasets serve as a clean baseline of adjustable complexity into
which different types of noise may be introduced. We illustrate how the
framework can be applied, how it enables quantification of label noise and how
it improves over existing methodologies
Instance-Dependent Noisy Label Learning via Graphical Modelling
Noisy labels are unavoidable yet troublesome in the ecosystem of deep
learning because models can easily overfit them. There are many types of label
noise, such as symmetric, asymmetric and instance-dependent noise (IDN), with
IDN being the only type that depends on image information. Such dependence on
image information makes IDN a critical type of label noise to study, given that
labelling mistakes are caused in large part by insufficient or ambiguous
information about the visual classes present in images. Aiming to provide an
effective technique to address IDN, we present a new graphical modelling
approach called InstanceGM, that combines discriminative and generative models.
The main contributions of InstanceGM are: i) the use of the continuous
Bernoulli distribution to train the generative model, offering significant
training advantages, and ii) the exploration of a state-of-the-art noisy-label
discriminative classifier to generate clean labels from instance-dependent
noisy-label samples. InstanceGM is competitive with current noisy-label
learning approaches, particularly in IDN benchmarks using synthetic and
real-world datasets, where our method shows better accuracy than the
competitors in most experiments.Comment: Accepted at WACV 202
Generating the Ground Truth: Synthetic Data for Label Noise Research
Most real-world classification tasks suffer from label noise to some extent. Such noise in the data adversely affects the generalization error of learned models and complicates the evaluation of noise-handling methods, as their performance cannot be accurately measured without clean labels. In label noise research, typically either noisy or incomplex simulated data are accepted as a baseline, into which additional noise with known properties is injected. In this paper, we propose SYNLABEL, a framework that aims to improve upon the aforementioned methodologies. It allows for creating a noiseless dataset informed by real data, by either pre-specifying or learning a function and defining it as the ground truth function from which labels are generated. Furthermore, by resampling a number of values for selected features in the function domain, evaluating the function and aggregating the resulting labels, each data point can be assigned a soft label or label distribution. Such distributions allow for direct injection and quantification of label noise. The generated datasets serve as a clean baseline of adjustable complexity into which different types of noise may be introduced. We illustrate how the framework can be applied, how it enables quantification of label noise and how it improves over existing methodologies
Noise Models in Classification: Unified Nomenclature, Extended Taxonomy and Pragmatic Categorization
This paper presents the first review of noise models in classification covering both label and
attribute noise. Their study reveals the lack of a unified nomenclature in this field. In order to address
this problem, a tripartite nomenclature based on the structural analysis of existing noise models is
proposed. Additionally, a revision of their current taxonomies is carried out, which are combined
and updated to better reflect the nature of any model. Finally, a categorization of noise models is
proposed from a practical point of view depending on the characteristics of noise and the study
purpose. These contributions provide a variety of models to introduce noise, their characteristics
according to the proposed taxonomy and a unified way of naming them, which will facilitate their
identification and study, as well as the reproducibility of future research
Learning with Noisy Labels by Efficient Transition Matrix Estimation to Combat Label Miscorrection
Recent studies on learning with noisy labels have shown remarkable
performance by exploiting a small clean dataset. In particular, model agnostic
meta-learning-based label correction methods further improve performance by
correcting noisy labels on the fly. However, there is no safeguard on the label
miscorrection, resulting in unavoidable performance degradation. Moreover,
every training step requires at least three back-propagations, significantly
slowing down the training speed. To mitigate these issues, we propose a robust
and efficient method that learns a label transition matrix on the fly.
Employing the transition matrix makes the classifier skeptical about all the
corrected samples, which alleviates the miscorrection issue. We also introduce
a two-head architecture to efficiently estimate the label transition matrix
every iteration within a single back-propagation, so that the estimated matrix
closely follows the shifting noise distribution induced by label correction.
Extensive experiments demonstrate that our approach shows the best performance
in training efficiency while having comparable or better accuracy than existing
methods.Comment: ECCV202