311 research outputs found
Tackling the supervised label ranking problem by bagging weak learners
Preference learning is the branch of machine learning in charge of inducing preference models from data. In this paper we focus on the task known as label ranking problem, whose goal is to predict a ranking among the different labels the class variable can take. Our contribution is twofold: (i) taking as basis the tree-based algorithm LRT described in [1], we design weaker tree-based models which can be learnt more efficiently; and (ii) we show that bagging these weak learners improves not only the LRT algorithm, but also the state-of the-art one (IBLR [1]). Furthermore, the bagging algorithm which takes the weak LRT-based models as base classifiers is competitive in time with respect to LRT and IBLR methods. To check the good ness of our proposal, we conduct a broad experimental study over the standard benchmark used in the label ranking problem literature
Boosting Applied to Word Sense Disambiguation
In this paper Schapire and Singer's AdaBoost.MH boosting algorithm is applied
to the Word Sense Disambiguation (WSD) problem. Initial experiments on a set of
15 selected polysemous words show that the boosting approach surpasses Naive
Bayes and Exemplar-based approaches, which represent state-of-the-art accuracy
on supervised WSD. In order to make boosting practical for a real learning
domain of thousands of words, several ways of accelerating the algorithm by
reducing the feature space are studied. The best variant, which we call
LazyBoosting, is tested on the largest sense-tagged corpus available containing
192,800 examples of the 191 most frequent and ambiguous English words. Again,
boosting compares favourably to the other benchmark algorithms.Comment: 12 page
Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics
In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem
A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework
Class imbalance poses new challenges when it comes to classifying data
streams. Many algorithms recently proposed in the literature tackle this
problem using a variety of data-level, algorithm-level, and ensemble
approaches. However, there is a lack of standardized and agreed-upon procedures
on how to evaluate these algorithms. This work presents a taxonomy of
algorithms for imbalanced data streams and proposes a standardized, exhaustive,
and informative experimental testbed to evaluate algorithms in a collection of
diverse and challenging imbalanced data stream scenarios. The experimental
study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced
data streams that combine static and dynamic class imbalance ratios,
instance-level difficulties, concept drift, real-world and semi-synthetic
datasets in binary and multi-class scenarios. This leads to the largest
experimental study conducted so far in the data stream mining domain. We
discuss the advantages and disadvantages of state-of-the-art classifiers in
each of these scenarios and we provide general recommendations to end-users for
selecting the best algorithms for imbalanced data streams. Additionally, we
formulate open challenges and future directions for this domain. Our
experimental testbed is fully reproducible and easy to extend with new methods.
This way we propose the first standardized approach to conducting experiments
in imbalanced data streams that can be used by other researchers to create
trustworthy and fair evaluation of newly proposed methods. Our experimental
framework can be downloaded from
https://github.com/canoalberto/imbalanced-streams
- …