Search CORE

2 research outputs found

Crowd Learning with Candidate Labeling: an EM-based Solution

Author: AP Dawid
AP Dempster
E Côme
J Hernández-González
J Zhang
JC Falmagne
M Venanzi
PL López-Cruz
SJ Brams
VC Raykar
YX Ding
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Crowdsourcing is widely used nowadays in machine learning for data labeling. Although in the traditional case annotators are asked to provide a single label for each instance, novel approaches allow annotators, in case of doubt, to choose a subset of labels as a way to extract more information from them. In both the traditional and these novel approaches, the reliability of the labelers can be modeled based on the collections of labels that they provide. In this paper, we propose an Expectation-Maximization-based method for crowdsourced data with candidate sets. Iteratively the likelihood of the parameters that model the reliability of the labelers is maximized, while the ground truth is estimated. The experimental results suggest that the proposed method performs better than the baseline aggregation schemes in terms of estimated accuracy.BES-2016-078095 SVP-2014-068574 IT609-13 TIN2016-78365-

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

BCAM's Institutional Repository Data

Weakly Supervised-Based Oversampling for High Imbalance and High Dimensionality Data Classification

Author: Li Yan-Fu
Qian Min
Publication venue
Publication date: 06/10/2020
Field of study

With the abundance of industrial datasets, imbalanced classification has become a common problem in several application domains. Oversampling is an effective method to solve imbalanced classification. One of the main challenges of the existing oversampling methods is to accurately label the new synthetic samples. Inaccurate labels of the synthetic samples would distort the distribution of the dataset and possibly worsen the classification performance. This paper introduces the idea of weakly supervised learning to handle the inaccurate labeling of synthetic samples caused by traditional oversampling methods. Graph semi-supervised SMOTE is developed to improve the credibility of the synthetic samples' labels. In addition, we propose cost-sensitive neighborhood components analysis for high dimensional datasets and bootstrap based ensemble framework for highly imbalanced datasets. The proposed method has achieved good classification performance on 8 synthetic datasets and 3 real-world datasets, especially for high imbalance and high dimensionality problems. The average performances and robustness are better than the benchmark methods

arXiv.org e-Print Archive