605 research outputs found
Constrained Naïve Bayes with application to unbalanced data classification
The Naïve Bayes is a tractable and efficient approach for statistical classification. In
general classification problems, the consequences of misclassifications may be rather
different in different classes, making it crucial to control misclassification rates in the
most critical and, in many realworld problems, minority cases, possibly at the expense
of higher misclassification rates in less problematic classes. One traditional approach
to address this problem consists of assigning misclassification costs to the different
classes and applying the Bayes rule, by optimizing a loss function. However, fixing
precise values for such misclassification costs may be problematic in realworld appli cations. In this paper we address the issue of misclassification for the Naïve Bayes
classifier. Instead of requesting precise values of misclassification costs, threshold val ues are used for different performance measures. This is done by adding constraints to
the optimization problem underlying the estimation process. Our findings show that,
under a reasonable computational cost, indeed, the performance measures under con sideration achieve the desired levels yielding a user-friendly constrained classification
procedure
Variable selection for Naive Bayes classification
The Naive Bayes has proven to be a tractable and efficient method for classification in multivariate analysis. However, features are usually correlated, a fact that violates the Naive Bayes' assumption of conditional independence, and may deteriorate the method's performance. Moreover, datasets are often characterized by a large number of features, which may complicate the interpretation of the results as well as slow down the method's execution. In this paper we propose a sparse version of the Naive Bayes classifier that is characterized by three properties. First, the sparsity is achieved taking into account the correlation structure of the covariates. Second, different performance measures can be used to guide the selection of features. Third, performance constraints on groups of higher interest can be included. Our proposal leads to a smart search, which yields competitive running times, whereas the flexibility in terms of performance measure for classification is integrated. Our findings show that, when compared against well-referenced feature selection approaches, the proposed sparse Naive Bayes obtains competitive results regarding accuracy, sparsity and running times for balanced datasets. In the case of datasets with unbalanced (or with different importance) classes, a better compromise between classification rates for the different classes is achieved.This research is partially supported by research grants and projects MTM2015-65915-R (Ministerio de Economia y Competitividad, Spain) and PID2019-110886RB-I00 (Ministerio de Ciencia, Innovacion y Universidades, Spain) , FQM-329 and P18-FR-2369 (Junta de Andalucia, Spain) , PR2019-029 (Universidad de Cadiz, Spain) , Fundacion BBVA and EC H2020 MSCA RISE NeEDS Project (Grant agreement ID: 822214) . This support is gratefully acknowledged.
Documen
Refining gene signatures: a Bayesian approach
<p>Abstract</p> <p>Background</p> <p>In high density arrays, the identification of relevant genes for disease classification is complicated by not only the curse of dimensionality but also the highly correlated nature of the array data. In this paper, we are interested in the question of how many and which genes should be selected for a disease class prediction. Our work consists of a Bayesian supervised statistical learning approach to refine gene signatures with a regularization which penalizes for the correlation between the variables selected.</p> <p>Results</p> <p>Our simulation results show that we can most often recover the correct subset of genes that predict the class as compared to other methods, even when accuracy and subset size remain the same. On real microarray datasets, we show that our approach can refine gene signatures to obtain either the same or better predictive performance than other existing methods with a smaller number of genes.</p> <p>Conclusions</p> <p>Our novel Bayesian approach includes a prior which penalizes highly correlated features in model selection and is able to extract key genes in the highly correlated context of microarray data. The methodology in the paper is described in the context of microarray data, but can be applied to any array data (such as micro RNA, for example) as a first step towards predictive modeling of cancer pathways. A user-friendly software implementation of the method is available.</p
Getting the Most Out of Your Data: Multitask Bayesian Network Structure Learning, Predicting Good Probabilities and Ensemble Selection
First, I consider the problem of simultaneously learning the
structures of multiple Bayesian networks from multiple related
datasets. I present a multitask Bayes net structure learning
algorithm that is able to learn more accurate network structures by
transferring useful information between the datasets. The algorithm
extends the score and search techniques used in traditional structure
learning to the multitask case by defining a scoring function for
sets of structures (one structure for each task) and an
efficient procedure for searching for a high scoring set of
structures. I also address the task selection problem in the context
of multitask Bayes net structure learning. Unlike in other multitask
learning scenarios, in the Bayes net structure learning setting there
is a clear definition of task relatedness: two tasks are related if
they have similar structures. This allows one to automatically select
a set of related tasks to be used by multitask structure learning.
Second, I examine the relationship between the predictions made by
different supervised learning algorithms and true posterior
probabilities. I show that quasi-maximum margin methods such as
boosted decision trees and SVMs push probability mass away from 0 and
1 yielding a characteristic sigmoid shaped distortion in the predicted
probabilities. Naive Bayes pushes probabilities toward 0 and 1. Other
models such as neural nets, logistic regression and bagged trees
usually do not have these biases and predict well calibrated
probabilities. I experiment with two ways of correcting the biased
probabilities predicted by some learning methods: Platt Scaling and
Isotonic Regression. I qualitatively examine what distortions these
calibration methods are suitable for and quantitatively examine how
much data they need to be effective.
Third, I present a method for constructing ensembles from libraries of
thousands of models. Model libraries are generated using different
learning algorithms and parameter settings. Forward stepwise
selection is used to add to the ensemble the models that maximize its
performance. The main drawback of ensemble selection is that it
builds models that are very large and slow at test time. This
drawback, however, can be overcome with little or no loss in
performance by using model compression.The work in this dissertation was supported by NSF grants 0347318, 0412930, 0427914, and 0612031
- …