605 research outputs found

    Predicting Dangerous Seismic Events in Coal Mines under Distribution Drift

    Full text link

    Tissue recognition for contrast enhanced ultrasound videos

    Get PDF

    Constrained Naïve Bayes with application to unbalanced data classification

    Get PDF
    The Naïve Bayes is a tractable and efficient approach for statistical classification. In general classification problems, the consequences of misclassifications may be rather different in different classes, making it crucial to control misclassification rates in the most critical and, in many realworld problems, minority cases, possibly at the expense of higher misclassification rates in less problematic classes. One traditional approach to address this problem consists of assigning misclassification costs to the different classes and applying the Bayes rule, by optimizing a loss function. However, fixing precise values for such misclassification costs may be problematic in realworld appli cations. In this paper we address the issue of misclassification for the Naïve Bayes classifier. Instead of requesting precise values of misclassification costs, threshold val ues are used for different performance measures. This is done by adding constraints to the optimization problem underlying the estimation process. Our findings show that, under a reasonable computational cost, indeed, the performance measures under con sideration achieve the desired levels yielding a user-friendly constrained classification procedure

    Variable selection for Naive Bayes classification

    Get PDF
    The Naive Bayes has proven to be a tractable and efficient method for classification in multivariate analysis. However, features are usually correlated, a fact that violates the Naive Bayes' assumption of conditional independence, and may deteriorate the method's performance. Moreover, datasets are often characterized by a large number of features, which may complicate the interpretation of the results as well as slow down the method's execution. In this paper we propose a sparse version of the Naive Bayes classifier that is characterized by three properties. First, the sparsity is achieved taking into account the correlation structure of the covariates. Second, different performance measures can be used to guide the selection of features. Third, performance constraints on groups of higher interest can be included. Our proposal leads to a smart search, which yields competitive running times, whereas the flexibility in terms of performance measure for classification is integrated. Our findings show that, when compared against well-referenced feature selection approaches, the proposed sparse Naive Bayes obtains competitive results regarding accuracy, sparsity and running times for balanced datasets. In the case of datasets with unbalanced (or with different importance) classes, a better compromise between classification rates for the different classes is achieved.This research is partially supported by research grants and projects MTM2015-65915-R (Ministerio de Economia y Competitividad, Spain) and PID2019-110886RB-I00 (Ministerio de Ciencia, Innovacion y Universidades, Spain) , FQM-329 and P18-FR-2369 (Junta de Andalucia, Spain) , PR2019-029 (Universidad de Cadiz, Spain) , Fundacion BBVA and EC H2020 MSCA RISE NeEDS Project (Grant agreement ID: 822214) . This support is gratefully acknowledged. Documen

    Refining gene signatures: a Bayesian approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In high density arrays, the identification of relevant genes for disease classification is complicated by not only the curse of dimensionality but also the highly correlated nature of the array data. In this paper, we are interested in the question of how many and which genes should be selected for a disease class prediction. Our work consists of a Bayesian supervised statistical learning approach to refine gene signatures with a regularization which penalizes for the correlation between the variables selected.</p> <p>Results</p> <p>Our simulation results show that we can most often recover the correct subset of genes that predict the class as compared to other methods, even when accuracy and subset size remain the same. On real microarray datasets, we show that our approach can refine gene signatures to obtain either the same or better predictive performance than other existing methods with a smaller number of genes.</p> <p>Conclusions</p> <p>Our novel Bayesian approach includes a prior which penalizes highly correlated features in model selection and is able to extract key genes in the highly correlated context of microarray data. The methodology in the paper is described in the context of microarray data, but can be applied to any array data (such as micro RNA, for example) as a first step towards predictive modeling of cancer pathways. A user-friendly software implementation of the method is available.</p

    Getting the Most Out of Your Data: Multitask Bayesian Network Structure Learning, Predicting Good Probabilities and Ensemble Selection

    Full text link
    First, I consider the problem of simultaneously learning the structures of multiple Bayesian networks from multiple related datasets. I present a multitask Bayes net structure learning algorithm that is able to learn more accurate network structures by transferring useful information between the datasets. The algorithm extends the score and search techniques used in traditional structure learning to the multitask case by defining a scoring function for sets of structures (one structure for each task) and an efficient procedure for searching for a high scoring set of structures. I also address the task selection problem in the context of multitask Bayes net structure learning. Unlike in other multitask learning scenarios, in the Bayes net structure learning setting there is a clear definition of task relatedness: two tasks are related if they have similar structures. This allows one to automatically select a set of related tasks to be used by multitask structure learning. Second, I examine the relationship between the predictions made by different supervised learning algorithms and true posterior probabilities. I show that quasi-maximum margin methods such as boosted decision trees and SVMs push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion in the predicted probabilities. Naive Bayes pushes probabilities toward 0 and 1. Other models such as neural nets, logistic regression and bagged trees usually do not have these biases and predict well calibrated probabilities. I experiment with two ways of correcting the biased probabilities predicted by some learning methods: Platt Scaling and Isotonic Regression. I qualitatively examine what distortions these calibration methods are suitable for and quantitatively examine how much data they need to be effective. Third, I present a method for constructing ensembles from libraries of thousands of models. Model libraries are generated using different learning algorithms and parameter settings. Forward stepwise selection is used to add to the ensemble the models that maximize its performance. The main drawback of ensemble selection is that it builds models that are very large and slow at test time. This drawback, however, can be overcome with little or no loss in performance by using model compression.The work in this dissertation was supported by NSF grants 0347318, 0412930, 0427914, and 0612031
    corecore