2 research outputs found
Crowd Learning with Candidate Labeling: an EM-based Solution
Crowdsourcing is widely used nowadays in machine learning for data labeling. Although in the traditional case annotators are
asked to provide a single label for each instance, novel approaches allow annotators, in case of doubt, to choose a subset of labels as a way to extract more information from them. In both the traditional and these novel approaches, the reliability of the labelers can be modeled based on the collections of labels that they provide. In this paper, we propose an Expectation-Maximization-based method for crowdsourced data with candidate sets. Iteratively the likelihood of the parameters that model
the reliability of the labelers is maximized, while the ground truth is estimated. The experimental results suggest that the proposed method performs better than the baseline aggregation schemes in terms of estimated accuracy.BES-2016-078095
SVP-2014-068574
IT609-13
TIN2016-78365-
Weakly Supervised-Based Oversampling for High Imbalance and High Dimensionality Data Classification
With the abundance of industrial datasets, imbalanced classification has
become a common problem in several application domains. Oversampling is an
effective method to solve imbalanced classification. One of the main challenges
of the existing oversampling methods is to accurately label the new synthetic
samples. Inaccurate labels of the synthetic samples would distort the
distribution of the dataset and possibly worsen the classification performance.
This paper introduces the idea of weakly supervised learning to handle the
inaccurate labeling of synthetic samples caused by traditional oversampling
methods. Graph semi-supervised SMOTE is developed to improve the credibility of
the synthetic samples' labels. In addition, we propose cost-sensitive
neighborhood components analysis for high dimensional datasets and bootstrap
based ensemble framework for highly imbalanced datasets. The proposed method
has achieved good classification performance on 8 synthetic datasets and 3
real-world datasets, especially for high imbalance and high dimensionality
problems. The average performances and robustness are better than the benchmark
methods