1,843 research outputs found

    Bayesian Prediction Model Based on Attribute Weighting and Kernel Density Estimations

    Get PDF
    Although naïve Bayes learner has been proven to show reasonable performance in machine learning, it often suffers from a few problems with handling real world data. First problem is conditional independence; the second problem is the usage of frequency estimator. Therefore, we have proposed methods to solve these two problems revolving around naïve Bayes algorithms. By using an attribute weighting method, we have been able to handle conditional independence assumption issue, whereas, for the case of the frequency estimators, we have found a way to weaken the negative effects through our proposed smooth kernel method. In this paper, we have proposed a compact Bayes model, in which a smooth kernel augments weights on likelihood estimation. We have also chosen an attribute weighting method which employs mutual information metric to cooperate with the framework. Experiments have been conducted on UCI benchmark datasets and the accuracy of our proposed learner has been compared with that of standard naïve Bayes. The experimental results have demonstrated the effectiveness and efficiency of our proposed learning algorithm

    Prediction of lung tumor types based on protein attributes by machine learning algorithms

    Full text link

    A "non-parametric" version of the naive Bayes classifier

    Get PDF
    Many algorithms have been proposed for the machine learning task of classication. One of the simplest methods, the naive Bayes classifyer, has often been found to give good performance despite the fact that its underlying assumptions (of independence and a Normal distribution of the variables) are perhaps violated. In previous work, we applied naive Bayes and other standard algorithms to a breast cancer database from Nottingham City Hospital in which the variables are highly non-Normal and found that the algorithm performed well when predicting a class that had been derived from the same data. However, when we then applied naive Bayes to predict an alternative clinical variable, it performed much worse than other techniques. This motivated us to propose an alternative method, based on naive Bayes, which removes the requirement for the variables to be Normally distributed, but retains the essential structure and other underlying assumptions of the method. We tested our novel algorithm on our breast cancer data and on three UCI datasets which also exhibited strong violations of Normality. We found our algorithm outperformed naive Bayes in all four cases and outperformed multinomial logistic regression (MLR) in two cases. We conclude that our method offers a competitive alternative to MLR and naive Bayes when dealing with data sets in which non-Normal distributions are observed

    Applications Of Machine Learning In Biology And Medicine

    Get PDF
    Machine learning as a field is defined to be the set of computational algorithms that improve their performance by assimilating data. As such, the field as a whole has found applications in many diverse disciplines from robotics and communication in engineering to economics and finance, and also biology and medicine. It should not come as a surprise that many popular methods in use today have completely different origins. Despite this heterogeneity, different methods can be divided into standard tasks, such as supervised, unsupervised, semi-supervised and reinforcement learning. Although machine learning as a field can be formalized as methods trying to solve certain standard tasks, applying these tasks on datasets from different fields comes with certain caveats, and sometimes is fraught with challenges. In this thesis, we develop general procedures and novel solutions, dealing with practical problems that arise when modeling biological and medical data. Cost sensitive learning is an important area of research in machine learning which addresses the widespread and practical problem of dealing with different costs during the learning and deployment of classification algorithms. In many applications such as credit fraud detection, network intrusion and specifically medical diagnosis domains, prior class distributions are highly skewed, which makes the training examples very much unbalanced. Combining this with uneven misclassification costs renders standard machine learning approaches useless in learning an acceptable decision function. We experimentally show the benefits and shortcomings of various methods that convert cost blind learning algorithms to cost sensitive ones. Using the results and best practices found for cost sensitive learning, we design and develop a machine learning approach to ontology mapping. Next, we present a novel approach to deal with uncertainty in classification when costs are unknown or otherwise hard to assign. Support Vector Machines (SVM) are considered to be among the most successful approaches for classification. However prediction of instances near the decision boundary depends more on the specific parameter selection or noise in data, rather than a clear difference in features. In many applications such as medical diagnosis, these regions should be labeled as uncertain rather than assigned to any particular class. Furthermore, instances may belong to novel disease subtypes that are not from any previously known class. In such applications, declining to make a prediction could be beneficial when more powerful but expensive tests are available. We develop a novel approach for optimal selection of the threshold and show its successful application on three biological and medical datasets. The last part of this thesis provides novel solutions for handling high dimensional data. Although high-dimensional data is ubiquitously found in many disciplines, current life science research almost always involves high-dimensional genomics/proteomics data. The ``omics\u27\u27 data provide a wealth of information and have changed the research landscape in biology and medicine. However, these data are plagued with noise, redundancy and collinearity, which makes the discovery process very difficult and costly. Any method that can accurately detect irrelevant and noisy variables in omics data would be highly valuable. We present Robust Feature Selection (RFS), a randomized feature selection approach dedicated to low-sample high-dimensional data. RFS combines an embedded feature selection method with a randomization procedure for stability. Recent advances in sparse recovery and estimation methods have provided efficient and asymptotically consistent feature selection algorithms. However, these methods lack finite sample error control due to instability. Furthermore, the chances of correct recovery diminish with more collinearity among features. To overcome these difficulties, RFS uses a randomization procedure to provide an accurate and stable feature selection method. We thoroughly evaluate RFS by comparing it to a number of popular univariate and multivariate feature selection methods and show marked prediction accuracy improvement of a diagnostic signature, while preserving a good stability

    Alleviating Naive Bayes attribute independence assumption by attribute weighting

    Get PDF
    Despite the simplicity of the Naive Bayes classifier, it has continued to perform well against more sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. Of numerous approaches to refining the naive Bayes classifier, attribute weighting has received less attention than it warrants. Most approaches, perhaps influenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. Based on this premise, we propose a weighted naive Bayes algorithm, called WANBIA, that selects weights to minimize either the negative conditional log likelihood or the mean squared error objective functions. We perform extensive evaluations and find that WANBIA is a competitive alternative to state of the art classifiers like Random Forest, Logistic Regression and A1DE. © 2013 Nayyar A. Zaidi, Jesus Cerquides, Mark J. Carman and Geoffrey I. Webb.This research has been supported by the Australian Research Council under grant DP110101427 and Asian Office of Aerospace Research and Development, Air Force Office of Scientific Research under contract FA23861214030. The authors would like to thank Mark Hall for providing the code for CFS and MH. The authors would also like to thank anonymous reviewers for their insightful comments that helped improving the paper tremendously.Peer Reviewe

    Evaluation of machine-learning methods for ligand-based virtual screening

    Get PDF
    Machine-learning methods can be used for virtual screening by analysing the structural characteristics of molecules of known (in)activity, and we here discuss the use of kernel discrimination and naive Bayesian classifier (NBC) methods for this purpose. We report a kernel method that allows the processing of molecules represented by binary, integer and real-valued descriptors, and show that it is little different in screening performance from a previously described kernel that had been developed specifically for the analysis of binary fingerprint representations of molecular structure. We then evaluate the performance of an NBC when the training-set contains only a very few active molecules. In such cases, a simpler approach based on group fusion would appear to provide superior screening performance, especially when structurally heterogeneous datasets are to be processed

    Constrained Naïve Bayes with application to unbalanced data classification

    Get PDF
    The Naïve Bayes is a tractable and efficient approach for statistical classification. In general classification problems, the consequences of misclassifications may be rather different in different classes, making it crucial to control misclassification rates in the most critical and, in many realworld problems, minority cases, possibly at the expense of higher misclassification rates in less problematic classes. One traditional approach to address this problem consists of assigning misclassification costs to the different classes and applying the Bayes rule, by optimizing a loss function. However, fixing precise values for such misclassification costs may be problematic in realworld appli cations. In this paper we address the issue of misclassification for the Naïve Bayes classifier. Instead of requesting precise values of misclassification costs, threshold val ues are used for different performance measures. This is done by adding constraints to the optimization problem underlying the estimation process. Our findings show that, under a reasonable computational cost, indeed, the performance measures under con sideration achieve the desired levels yielding a user-friendly constrained classification procedure
    corecore