373 research outputs found
An Ensemble Multilabel Classification for Disease Risk Prediction
It is important to identify and prevent disease risk as early as possible through regular physical examinations. We formulate the disease risk prediction into a multilabel classification problem. A novel Ensemble Label Power-set Pruned datasets Joint Decomposition (ELPPJD) method is proposed in this work. First, we transform the multilabel classification into a multiclass classification. Then, we propose the pruned datasets and joint decomposition methods to deal with the imbalance learning problem. Two strategies size balanced (SB) and label similarity (LS) are designed to decompose the training dataset. In the experiments, the dataset is from the real physical examination records. We contrast the performance of the ELPPJD method with two different decomposition strategies. Moreover, the comparison between ELPPJD and the classic multilabel classification methods RAkEL and HOMER is carried out. The experimental results show that the ELPPJD method with label similarity strategy has outstanding performance
Learning preferences for large scale multi-label problems
Despite that the majority of machine learning approaches aim to solve binary classification problems, several real-world applications require specialized algorithms able to handle many different classes, as in the case of single-label multi-class and multi-label classification problems. The Label Ranking framework is a generalization of the above mentioned settings, which aims to map instances from the input space to a total order over the set of possible labels. However, generally these algorithms are more complex than binary ones, and their application on large-scale datasets could be untractable. The main contribution of this work is the proposal of a novel general online preference-based label ranking framework. The proposed framework is able to solve binary, multi-class, multi-label and ranking problems. A comparison with other baselines has been performed, showing effectiveness and efficiency in a real-world large-scale multi-label task
API design for machine learning software: experiences from the scikit-learn project
Scikit-learn is an increasingly popular machine learning li- brary. Written
in Python, it is designed to be simple and efficient, accessible to
non-experts, and reusable in various contexts. In this paper, we present and
discuss our design choices for the application programming interface (API) of
the project. In particular, we describe the simple and elegant interface shared
by all learning and processing units in the library and then discuss its
advantages in terms of composition and reusability. The paper also comments on
implementation details specific to the Python ecosystem and analyzes obstacles
faced by users and developers of the library
Fast Label Embeddings via Randomized Linear Algebra
Many modern multiclass and multilabel problems are characterized by
increasingly large output spaces. For these problems, label embeddings have
been shown to be a useful primitive that can improve computational and
statistical efficiency. In this work we utilize a correspondence between rank
constrained estimation and low dimensional label embeddings that uncovers a
fast label embedding algorithm which works in both the multiclass and
multilabel settings. The result is a randomized algorithm whose running time is
exponentially faster than naive algorithms. We demonstrate our techniques on
two large-scale public datasets, from the Large Scale Hierarchical Text
Challenge and the Open Directory Project, where we obtain state of the art
results.Comment: To appear in the proceedings of the ECML/PKDD 2015 conference.
Reference implementation available at https://github.com/pmineiro/randembe
- …