15 research outputs found

    Locally weighted learning: How and when does it work in Bayesian networks?

    Full text link
    © 2016, Taylor and Francis Ltd. All rights reserved. Bayesian network (BN), a simple graphical notation for conditional independence assertions, is promised to represent the probabilistic relationships between diseases and symptoms. Learning the structure of a Bayesian network classifier (BNC) encodes conditional independence assumption between attributes, which may deteriorate the classification performance. One major approach to mitigate the BNC’s primary weakness (the attributes independence assumption) is the locally weighted approach. And this type of approach has been proved to achieve good performance for naive Bayes, a BNC with simple structure. However, we do not know whether or how effective it works for improving the performance of the complex BNC. In this paper, we first do a survey on the complex structure models for BNCs and their improvements, then carry out a systematically experimental analysis to investigate the effectiveness of locally weighted method for complex BNCs, e.g., tree-augmented naive Bayes (TAN), averaged one-dependence estimators AODE and hidden naive Bayes (HNB), measured by classification accuracy (ACC) and the area under the ROC curve ranking (AUC). Experiments and comparisons on 36 benchmark data sets collected from University of California, Irvine (UCI) in Weka system demonstrate that locally weighting technologies just slightly outperforms unweighted complex BNCs on ACC and AUC. In other words, although locally weighting could significantly improve the performance of NB (a BNC with simple structure), it could not work well on BNCs with complex structures. This is because the performance improvements of BNCs are attributed to their structures not the locally weighting

    Predicting automobile insurance fraud using classical and machine learning models

    Get PDF
    Insurance fraud claims have become a major problem in the insurance industry. Several investigations have been carried out to eliminate negative impacts on the insurance industry as this immoral act has caused the loss of billions of dollars. In this paper, a comparative study was carried out to assess the performance of various classification models, namely logistic regression, neural network (NN), support vector machine (SVM), tree augmented naĂŻve Bayes (NB), decision tree (DT), random forest (RF) and AdaBoost with different model settings for predicting automobile insurance fraud claims. Results reveal that the tree augmented NB outperformed other models based on several performance metrics with accuracy (79.35%), sensitivity (44.70%), misclassification rate (20.65%), area under curve (0.81) and Gini (0.62). In addition, the result shows that the AdaBoost algorithm can improve the classification performance of the decision tree. These findings are useful for insurance professionals to identify potential insurance fraud claim cases

    SODE: Self-Adaptive One-Dependence Estimators for classification

    Full text link
    © 2015 Elsevier Ltd. SuperParent-One-Dependence Estimators (SPODEs) represent a family of semi-naive Bayesian classifiers which relax the attribute independence assumption of Naive Bayes (NB) to allow each attribute to depend on a common single attribute (superparent). SPODEs can effectively handle data with attribute dependency but still inherent NB's key advantages such as computational efficiency and robustness for high dimensional data. In reality, determining an optimal superparent for SPODEs is difficult. One common approach is to use weighted combinations of multiple SPODEs, each having a different superparent with a properly assigned weight value (i.e., a weight value is assigned to each attribute). In this paper, we propose a self-adaptive SPODEs, namely SODE, which uses immunity theory in artificial immune systems to automatically and self-adaptively select the weight for each single SPODE. SODE does not need to know the importance of individual SPODE nor the relevance among SPODEs, and can flexibly and efficiently search optimal weight values for each SPODE during the learning process. Extensive experiments and comparisons on 56 benchmark data sets, and validations on image and text classification, demonstrate that SODE outperforms state-of-the-art weighted SPODE algorithms and is suitable for a wide range of learning tasks. Results also confirm that SODE provides an appropriate balance between runtime efficiency and accuracy

    Alleviating Naive Bayes attribute independence assumption by attribute weighting

    Get PDF
    Despite the simplicity of the Naive Bayes classifier, it has continued to perform well against more sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. Of numerous approaches to refining the naive Bayes classifier, attribute weighting has received less attention than it warrants. Most approaches, perhaps influenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. Based on this premise, we propose a weighted naive Bayes algorithm, called WANBIA, that selects weights to minimize either the negative conditional log likelihood or the mean squared error objective functions. We perform extensive evaluations and find that WANBIA is a competitive alternative to state of the art classifiers like Random Forest, Logistic Regression and A1DE. © 2013 Nayyar A. Zaidi, Jesus Cerquides, Mark J. Carman and Geoffrey I. Webb.This research has been supported by the Australian Research Council under grant DP110101427 and Asian Office of Aerospace Research and Development, Air Force Office of Scientific Research under contract FA23861214030. The authors would like to thank Mark Hall for providing the code for CFS and MH. The authors would also like to thank anonymous reviewers for their insightful comments that helped improving the paper tremendously.Peer Reviewe

    A novel dissolved oxygen prediction model based on enhanced semi-naive Bayes for ocean ranches in northeast China

    Get PDF
    A challenge of achieving intelligent marine ranching is the prediction of dissolved oxygen (DO). DO directly reflects marine ranching environmental conditions. Through accurate DO predictions, timely human intervention can be made in marine pasture water environments to avoid problems such as reduced yields or marine crop death due to low oxygen concentrations in the water. We use an enhanced semi-naive Bayes model for prediction based on an analysis of DO data from marine pastures in northeastern China from the past three years. Based on the semi-naive Bayes model, this paper takes the possible values of a DO difference series as categories, counts the possible values of the first-order difference series and the difference series of the interval before each possible value, and selects the most probable difference series value at the next moment. The prediction accuracy is optimized by adjusting the attribute length and frequency threshold of the difference sequence. The enhanced semi-naive Bayes model is compared with LSTM, RBF, SVR and other models, and the error function and Willmott’s index of agreement are used to evaluate the prediction accuracy. The experimental results show that the proposed model has high prediction accuracy for DO attributes in marine pastures

    A case study of applying boosting naive bayes to claim fraud diagnosis

    Full text link

    Data Mining Techniques for Complex User-Generated Data

    Get PDF
    Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains

    Improving Binary Classifier Performance Through an Informed Sampling Approach and Imputation

    Get PDF
    RÉSUMÉ Au cours des deux dernières décennies, des progrès importants dans le domaine de l’apprentissage automatique ont été réalisés grâce à des techniques d’échantillonnage. Relevons par exemple le renforcement (boosting), une technique qui assigne des poids aux observations pour améliorer l’entraînement du modèle, ainsi que la technique d’apprentissage actif qui utilise des données non étiquetées partielles pour décider dynamiquement quels cas sont les plus pertinents à demander à un oracle d’étiqueter. Cette thèse s’inscrit dans ces recherches et présente une nouvelle technique d’échantillonnage qui utilise l’entropie des données pour guider l’échantillonnage, un processus que nous appelons l’échantillonnage informé. L’idée centrale est que la fiabilité de l’estimation des paramètres d’un modèle peut dépendre de l’entropie des variables. Donc, l’adaptation du taux d’échantillonnage de variables basée sur leur entropie peut conduire à de meilleures estimations des paramètres. Dans une série d’articles, nous étudions cette hypothèse pour trois modèles de classification, notamment Régression Logistique (LR), le modèle bayes naïf (NB) et le modèle d’arbre bayes naif (TAN—Tree Augmented Naive Bayes), en prenant une tâche de classification binaire avec une fonction d’erreur 0-1. Les résultats démontrent que l’échantillonnage d’entropie élevée (taux d’échantillonnage plus élevé pour les variables d’entropie élevée) améliore systématiquement les performances de prédiction du classificateur TAN. Toutefois, pour les classificateurs NB et LR, les résultats ne sont pas concluants. Des améliorations sont obtenues pour seulement la moitié des 11 ensembles de données utilisées et souvent les améliorations proviennent de l’échantillonnage à entropie élevée, rarement de l’échantillonnage à entropie faible. Cette première expérience est reproduite dans une deuxième étude, cette fois en utilisant un contexte plus réaliste où l’entropie des variables est inconnue à priori, mais plutôt estimée avec des données initiales et où l’échantillonnage est ajusté à la volée avec les nouvelles estimation de l’entropie. Les résultats démontrent qu’avec l’utilisation d’un ensemble de données initial de 1% du nombre total des exemplaires, qui variait de quelques centaines à environ 1000, les gains obtenus de l’étude précédente persistent pour le modèle TAN avec une amélioration moyenne de 13% dans la réduction l’erreur quadratique. Pour la même taille des semences, des améliorations ont également été obtenues pour le classificateur naïf bayésien par un facteur de 8% de l’entropie faible au lieu d’échantillonnage d’entropie élevée. L’échantillonnage informé implique nécessairement des valeurs manquantes, et de nombreux classificateurs nécessitent soit l’imputation des valeurs manquantes, ou peuvent être améliorés par imputation. Par conséquent, l’imputation et l’échantillonnage informatif sont susceptibles d’être combinés dans la pratique. La question évidente est de savoir si les gains obtenus de chacun sont additifs ou s’ils se rapportent d’une manière plus complexe. Nous étudions dans un premier temps comment les méthodes d’imputation affectent la performance des classificateurs puis si la combinaison de techniques d’imputation avec l’échantillonnage informé apporte des gains qui se cumulent. Le gain de méthodes d’imputation sont d’abord étudiés isolément avec une analyse comparative de la performance de certains nouveaux algorithmes et d’autres algorithmes d’imputation bien connus avec l’objectif de déterminer dans quelle mesure le motif des améliorations est stable dans les classificateurs pour la classification binaire. Ici encore, les résultats montrent que les améliorations obtenues par des techniques d’imputation peuvent varier considérablement par modèle et aussi par taux de valeur manquante. Nous étudions également les améliorations le long d’une autre dimension qui est de savoir si le taux d’échantillonnage par enregistrement est stable ou varie. Des différences mineures, mais statistiquement significatives sont observées dans les résultats, montrant que cette dimension peut également affecter les performances du classificateur. Dans une dernière étude, nous étudions empiriquement si les gains obtenus de l’échantillonnage informé et de l’imputation sont additifs, ou s’ils se combinent d’une manière plus complexe. Les résultats montrent que les gains individuels de l’échantillonnage informé et d’imputation sont du même ordre de grandeur, mais en général, ils ne sont pas une simple somme des améliorations individuelles. Il faut noter aussi que, malgré les résultats encourageants pour certaines combinaisons d’échantillonnage informées et des algorithmes d’imputation, une analyse détaillée des résultats de l’ensemble de données individuelles révèle que ces combinaisons apportent rarement des performances supérieures aux algorithmes d’imputation ou à l’échantillonnage informé individuellement. Les résultats de nos études fournissent une démonstration de l’efficacité de l’échantillonnage informé pour améliorer les performances de classification binaire pour le modèle TAN, mais les résultats sont plus mitigés pour NB et LR. En outre, l’échantillonnage à entropie élevée se révèle être le régime le plus bénéfique.----------ABSTRACT In the last two decades or so, some of the substantial advances in machine learning relate to sampling techniques. For example, boosting uses weighted sampling to improve model training, and active learning uses unlabeled data gathered so far to decide what are the most relevant data points to ask an oracle to label. This thesis introduces a novel sampling technique that uses features entropy to guide the sampling, a process we call informed sampling. The central idea is that the reliability of model parameter learning may be more sensitive to variables that have low, or high entropy. Therefore, adapting the sampling rate of variables based on their entropy may lead to better parameter estimates. In a series of papers, we first test this hypothesis for three classifier models, Logistic regression (LR), Naive Bayes (NB), and Tree Augmented Naive Bayes (TAN), and over a binary classification task with a 0-1 loss function. The results show that the high-entropy sampling (higher sampling rate for high entropy variables) systematically improves the prediction performance of the TAN classifier. However, for the NB and LR classifiers, the picture is more blurry. Improvements are obtained for only half of the 11 datasets used, and often the improvements come from high-entropy sampling, seldom from low-entropy sampling. This first experiment is replicated in a second study, this time using a more realistic context where the entropy of variables is unknown a priori, but instead is estimated with seed data and adjusted on the fly. Results showed that using a seed dataset of 1% of the total number of instances, which ranged from a few hundreds to around 1000, the improvements obtained from the former study hold for TAN with an average improvement of 13% in RMSE reduction. For the same seed size improvements were also obtained for the Naive Bayes classifier by a factor of 8% from low instead of high entropy sampling. Also, the pattern of improvements for LR was almost the same as obtained from the former study. Notwithstanding that classifier improvements can be obtained through informed sampling, but that the pattern of improvements varies across the informed sampling approach and the classifier model, we further investigate how the imputation methods affect this pattern. This question is of high importance because informed sampling necessarily implies missing values, and many classifiers either require the imputation of missing values, or can be improved by imputation. Therefore imputation and informative sampling are likely to be combined in practice. The obvious question is whether the gains obtained from each are additive or if they relate in a more complex manner. The gain from imputation methods are first studied in isolation with a comparative analysis of the performance of some new and some well known imputation algorithms, with the objective of determining to which extent the pattern of improvements is stable across classifiers for the binary classification and 0-1 loss function. Here too, results show that patterns of improvement of imputation algorithms can vary substantially per model and also per missing value rate. We also investigate the improvements along a different dimension which is whether the rate of sampling per record is stable or varies. Minor, but statistically significant differences are observed in the results, showing that this dimension can also affect classifier performance. In a final paper, first the levels of improvement from informed sampling are compared with those from a number of imputation techniques. Next, we empirically investigate whether the gains obtained from sampling and imputation are additive, or they combine in a more complex manner. The results show that the individual gains from informed sampling and imputation are within the same range and that combining high-entropy informed sampling with imputation brings significant gains to the classifiers’ performance, but generally, not as a simple sum of the individual improvements. It is also noteworthy that despite the encouraging results for some combinations of informed sampling and imputation algorithms, detailed analysis of individual dataset results reveals that these combinations rarely bring classification performance above the top imputation algorithms or informed sampling by themselves. The results of our studies provide evidence of the effectiveness of informed sampling to improve the binary classification performance of the TAN model. Also, high-entropy sampling is shown to be the most preferable scheme to be conducted. This for example, in the context of Computerized Adaptive Testing, can be translated to favoring the highly uncertain questions (items of average difficulty). Variable number of items administered is another factor that should be taken into account when imputation is involved

    Development and benchmarking a novel scatter search algorithm for learning probabilistic graphical models in healthcare

    Get PDF
    Healthcare data of small sizes are widespread, and the challenge of building accurate inference models is difficult. Many machine learning algorithms exist, but many are black boxes. Explainable models in healthcare are essential, so healthcare practitioners can understand the developed model and incorporate domain knowledge into the model. Probabilistic graphical models offer a visual way to represent relationships between data. Here we develop a new scatter search algorithm to learn Bayesian networks. This machine learning approach is applied to three case studies to understand the effectiveness in comparison with traditional machine learning techniques. First, a new scatter search approach is presented to construct the structure of a Bayesian network. Statistical tests are used to build small Directed acyclic graphs combined in an iterative process to build up multiple larger graphs. Probability distributions are fitted as the graphs are built up. These graphs are then scored based on classification performance. Once no new solutions can be found, the algorithm finishes. The first study looks at the effectiveness of the scatter search constructed Bayesian network against other machine learning algorithms in the same class. These algorithms are benchmarked against standard datasets from the UCI Machine Learning Repository, which has many published studies. The second study assesses the effectiveness of the scatter search Bayesian network for classifying ovarian cancer patients. Multiple other machine learning algorithms were applied alongside the Bayesian network. All data from this study were collected by clinicians from the Aneurin Bevan University Health Board. The study concluded that machine-learning techniques could be applied to classify patients based on early indicators. The third and final study looked into applying machine learning techniques to no-show breast cancer follow-up patients. Once again, the scatter search Bayesian network was used alongside other machine learning approaches. Socio-demographic and socio-economic factors involving low to middle-income families were used in this study with feature selection techniques to improve machine learning performance. It was found machine learning, when used with feature selection, could classify no-show patients with reasonable accuracy
    corecore