7,642 research outputs found
Interpretable multiclass classification by MDL-based rule lists
Interpretable classifiers have recently witnessed an increase in attention
from the data mining community because they are inherently easier to understand
and explain than their more complex counterparts. Examples of interpretable
classification models include decision trees, rule sets, and rule lists.
Learning such models often involves optimizing hyperparameters, which typically
requires substantial amounts of data and may result in relatively large models.
In this paper, we consider the problem of learning compact yet accurate
probabilistic rule lists for multiclass classification. Specifically, we
propose a novel formalization based on probabilistic rule lists and the minimum
description length (MDL) principle. This results in virtually parameter-free
model selection that naturally allows to trade-off model complexity with
goodness of fit, by which overfitting and the need for hyperparameter tuning
are effectively avoided. Finally, we introduce the Classy algorithm, which
greedily finds rule lists according to the proposed criterion. We empirically
demonstrate that Classy selects small probabilistic rule lists that outperform
state-of-the-art classifiers when it comes to the combination of predictive
performance and interpretability. We show that Classy is insensitive to its
only parameter, i.e., the candidate set, and that compression on the training
set correlates with classification performance, validating our MDL-based
selection criterion
Boosting the concordance index for survival data - a unified framework to derive and evaluate biomarker combinations
The development of molecular signatures for the prediction of time-to-event
outcomes is a methodologically challenging task in bioinformatics and
biostatistics. Although there are numerous approaches for the derivation of
marker combinations and their evaluation, the underlying methodology often
suffers from the problem that different optimization criteria are mixed during
the feature selection, estimation and evaluation steps. This might result in
marker combinations that are only suboptimal regarding the evaluation criterion
of interest. To address this issue, we propose a unified framework to derive
and evaluate biomarker combinations. Our approach is based on the concordance
index for time-to-event data, which is a non-parametric measure to quantify the
discrimatory power of a prediction rule. Specifically, we propose a
component-wise boosting algorithm that results in linear biomarker combinations
that are optimal with respect to a smoothed version of the concordance index.
We investigate the performance of our algorithm in a large-scale simulation
study and in two molecular data sets for the prediction of survival in breast
cancer patients. Our numerical results show that the new approach is not only
methodologically sound but can also lead to a higher discriminatory power than
traditional approaches for the derivation of gene signatures.Comment: revised manuscript - added simulation study, additional result
Robust Classification for Imprecise Environments
In real-world environments it usually is difficult to specify target
operating conditions precisely, for example, target misclassification costs.
This uncertainty makes building robust classification systems problematic. We
show that it is possible to build a hybrid classifier that will perform at
least as well as the best available classifier for any target conditions. In
some cases, the performance of the hybrid actually can surpass that of the best
known classifier. This robust performance extends across a wide variety of
comparison frameworks, including the optimization of metrics such as accuracy,
expected cost, lift, precision, recall, and workforce utilization. The hybrid
also is efficient to build, to store, and to update. The hybrid is based on a
method for the comparison of classifier performance that is robust to imprecise
class distributions and misclassification costs. The ROC convex hull (ROCCH)
method combines techniques from ROC analysis, decision analysis and
computational geometry, and adapts them to the particulars of analyzing learned
classifiers. The method is efficient and incremental, minimizes the management
of classifier performance data, and allows for clear visual comparisons and
sensitivity analyses. Finally, we point to empirical evidence that a robust
hybrid classifier indeed is needed for many real-world problems.Comment: 24 pages, 12 figures. To be published in Machine Learning Journal.
For related papers, see http://www.hpl.hp.com/personal/Tom_Fawcett/ROCCH
- …