479 research outputs found
Class-Imbalanced Complementary-Label Learning via Weighted Loss
Complementary-label learning (CLL) is widely used in weakly supervised
classification, but it faces a significant challenge in real-world datasets
when confronted with class-imbalanced training samples. In such scenarios, the
number of samples in one class is considerably lower than in other classes,
which consequently leads to a decline in the accuracy of predictions.
Unfortunately, existing CLL approaches have not investigate this problem. To
alleviate this challenge, we propose a novel problem setting that enables
learning from class-imbalanced complementary labels for multi-class
classification. To tackle this problem, we propose a novel CLL approach called
Weighted Complementary-Label Learning (WCLL). The proposed method models a
weighted empirical risk minimization loss by utilizing the class-imbalanced
complementary labels, which is also applicable to multi-class imbalanced
training samples. Furthermore, we derive an estimation error bound to provide
theoretical assurance. To evaluate our approach, we conduct extensive
experiments on several widely-used benchmark datasets and a real-world dataset,
and compare our method with existing state-of-the-art methods. The proposed
approach shows significant improvement in these datasets, even in the case of
multiple class-imbalanced scenarios. Notably, the proposed method not only
utilizes complementary labels to train a classifier but also solves the problem
of class imbalance.Comment: 9 pages, 9 figures, 3 table
Exploiting Universum data in AdaBoost using gradient descent
Recently, Universum data that does not belong to any class of the training data, has been applied for training better classifiers. In this paper, we address a novel boosting algorithm called UAdaBoost that can improve the classification performance of AdaBoost with Universum data. UAdaBoost chooses a function by minimizing the loss for labeled data and Universum data. The cost function is minimized by a greedy, stagewise, functional gradient procedure. Each training stage of UAdaBoost is fast and efficient. The standard AdaBoost weights labeled samples during training iterations while UAdaBoost gives an explicit weighting scheme for Universum samples as well. In addition, this paper describes the practical conditions for the effectiveness of Universum learning. These conditions are based on the analysis of the distribution of ensemble predictions over training samples. Experiments on handwritten digits classification and gender classification problems are presented. As exhibited by our experimental results, the proposed method can obtain superior performances over the standard AdaBoost by selecting proper Universum data. © 2014 Elsevier B.V
Does the dataset meet your expectations? Explaining sample representation in image data
Since the behavior of a neural network model is adversely affected by a lack of diversity in training data, we present a method that identifies and explains such deficiencies. When a dataset is labeled,we n ote that annotations alone are capable of providing a human interpretable summary of sample diversity. This allows explaining any lack of diversity as the mismatch found when comparing the actual distribution of annotations in the dataset with an expected distribution of annotations, specified manually to capture essential label diversity. While, in many practical cases, labeling (samples → annotations) is expensive, its inverse, simulation (annotations → samples) can be cheaper. By mapping the expected distribution of annotations into test samples using parametric simulation, we present a method that explains sample representation using the mismatch in diversity between simulated and collected data.We then apply the method to examine a dataset of geometric shapes to qualitatively and quantitatively explain sample representation in termsof comprehensible aspects such as size, position, and pixel brightness
Does the dataset meet your expectations? Explaining sample representation in image data
Since the behavior of a neural network model is adversely affected by a lack of diversity in training data, we present a method that identifies and explains such deficiencies. When a dataset is labeled,we n ote that annotations alone are capable of providing a human interpretable summary of sample diversity. This allows explaining any lack of diversity as the mismatch found when comparing the actual distribution of annotations in the dataset with an expected distribution of annotations, specified manually to capture essential label diversity. While, in many practical cases, labeling (samples → annotations) is expensive, its inverse, simulation (annotations → samples) can be cheaper. By mapping the expected distribution of annotations into test samples using parametric simulation, we present a method that explains sample representation using the mismatch in diversity between simulated and collected data.We then apply the method to examine a dataset of geometric shapes to qualitatively and quantitatively explain sample representation in termsof comprehensible aspects such as size, position, and pixel brightness
A Survey on Negative Transfer
Transfer learning (TL) tries to utilize data or knowledge from one or more
source domains to facilitate the learning in a target domain. It is
particularly useful when the target domain has few or no labeled data, due to
annotation expense, privacy concerns, etc. Unfortunately, the effectiveness of
TL is not always guaranteed. Negative transfer (NT), i.e., the source domain
data/knowledge cause reduced learning performance in the target domain, has
been a long-standing and challenging problem in TL. Various approaches to
handle NT have been proposed in the literature. However, this filed lacks a
systematic survey on the formalization of NT, their factors and the algorithms
that handle NT. This paper proposes to fill this gap. First, the definition of
negative transfer is considered and a taxonomy of the factors are discussed.
Then, near fifty representative approaches for handling NT are categorized and
reviewed, from four perspectives: secure transfer, domain similarity
estimation, distant transfer and negative transfer mitigation. NT in related
fields, e.g., multi-task learning, lifelong learning, and adversarial attacks
are also discussed
Machine learning for large-scale wearable sensor data in Parkinson disease:concepts, promises, pitfalls, and futures
For the treatment and monitoring of Parkinson's disease (PD) to be scientific, a key requirement is that measurement of disease stages and severity is quantitative, reliable, and repeatable. The last 50 years in PD research have been dominated by qualitative, subjective ratings obtained by human interpretation of the presentation of disease signs and symptoms at clinical visits. More recently, “wearable,” sensor-based, quantitative, objective, and easy-to-use systems for quantifying PD signs for large numbers of participants over extended durations have been developed. This technology has the potential to significantly improve both clinical diagnosis and management in PD and the conduct of clinical studies. However, the large-scale, high-dimensional character of the data captured by these wearable sensors requires sophisticated signal processing and machine-learning algorithms to transform it into scientifically and clinically meaningful information. Such algorithms that “learn” from data have shown remarkable success in making accurate predictions for complex problems in which human skill has been required to date, but they are challenging to evaluate and apply without a basic understanding of the underlying logic on which they are based. This article contains a nontechnical tutorial review of relevant machine-learning algorithms, also describing their limitations and how these can be overcome. It discusses implications of this technology and a practical road map for realizing the full potential of this technology in PD research and practice
An efficiency curve for evaluating imbalanced classifiers considering intrinsic data characteristics: Experimental analysis
Balancing the accuracy rates of the majority and minority classes is challenging in imbalanced
classification. Furthermore, data characteristics have a significant impact on the performance
of imbalanced classifiers, which are generally neglected by existing evaluation
methods. The objective of this study is to introduce a new criterion to comprehensively
evaluate imbalanced classifiers. Specifically, we introduce an efficiency curve that is established
using data envelopment analysis without explicit inputs (DEA-WEI), to determine
the trade-off between the benefits of improved minority class accuracy and the cost of
reduced majority class accuracy. In sequence, we analyze the impact of the imbalanced
ratio and typical imbalanced data characteristics on the efficiency of the classifiers.
Empirical analyses using 68 imbalanced data reveal that traditional classifiers such as
C4.5 and the k-nearest neighbor are more effective on disjunct data, whereas ensemble
and undersampling techniques are more effective for overlapping and noisy data. The efficiency
of cost-sensitive classifiers decreases dramatically when the imbalanced ratio
increases. Finally, we investigate the reasons for the different efficiencies of classifiers on
imbalanced data and recommend steps to select appropriate classifiers for imbalanced data
based on data characteristics.National Natural Science Foundation of China (NSFC) 71874023
71725001
71771037
7197104
- …