18 research outputs found

    Interval Estimation Naïve Bayes

    Full text link
    Abstract. Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier with assumptions of conditional independence among features given the class, called naïve Bayes, is competitive with state of the art classifiers. On this paper a new naive Bayes classifier called Interval Estimation naïve Bayes is proposed. Interval Estimation naïve Bayes performs on two phases. On the first phase an interval estimation of each probability necessary to specify the naïve Bayes is estimated. On the second phase the best combination of values inside these intervals is calculated with a heuristic search that is guided by the accuracy of the classifiers. The founded values in the search are the new parameters for the naïve Bayes classifier. Our new approach has shown to be quite competitive related to simple naïve Bayes. Experimental tests have been done with 21 data sets from the UCI repository.

    Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains

    Get PDF
    BACKGROUND: The subcellular location of a protein is closely related to its function. It would be worthwhile to develop a method to predict the subcellular location for a given protein when only the amino acid sequence of the protein is known. Although many efforts have been made to predict subcellular location from sequence information only, there is the need for further research to improve the accuracy of prediction. RESULTS: A novel method called HensBC is introduced to predict protein subcellular location. HensBC is a recursive algorithm which constructs a hierarchical ensemble of classifiers. The classifiers used are Bayesian classifiers based on Markov chain models. We tested our method on six various datasets; among them are Gram-negative bacteria dataset, data for discriminating outer membrane proteins and apoptosis proteins dataset. We observed that our method can predict the subcellular location with high accuracy. Another advantage of the proposed method is that it can improve the accuracy of the prediction of some classes with few sequences in training and is therefore useful for datasets with imbalanced distribution of classes. CONCLUSION: This study introduces an algorithm which uses only the primary sequence of a protein to predict its subcellular location. The proposed recursive scheme represents an interesting methodology for learning and combining classifiers. The method is computationally efficient and competitive with the previously reported approaches in terms of prediction accuracies as empirical results indicate. The code for the software is available upon request

    Model-based classification for subcellular localization prediction of proteins

    Get PDF

    Utilização de técnicas de data mining na previsão do plano terapêutico em medicina intensiva

    Get PDF
    Dissertação de mestrado em Engenharia e Gestão de Sistemas de InformaçãoUm dos principais dilemas existentes na medicina intensiva prende-se com o plano terapêutico, mais concretamente, que medicamentos e quando é que estes devem ser administrados a um doente. No plano terapêutico, a interpretação rápida e avaliação precisa de dados fisiológicos, são cruciais para uma tomada de decisão mais eficiente e eficaz por parte dos médicos. No sentido de apoiar a decisão dos médicos, este trabalho tem como objetivo prever o nível de sépsis e a melhor terapêutica para doentes com problemas microbiológicos, baseados nos níveis de sépsis. Para isso, foi desenvolvido um conjunto de modelos de Data Mining (DM), utilizando técnicas de previsão e modelos classificação, que irão possibilitar o médico decidir qual a terapêutica adequada a aplicar, bem como aquela que apresente uma elevada taxa de sucesso. Os dados utilizados nos modelos de DM foram recolhidos no Serviço de Cuidados Intensivos do Hospital de Santo António, Porto, Portugal. Nesta dissertação, foi utilizada a tarefa de previsão, o modelo de classificação, o método de aprendizagem supervisionada e os algoritmos: Árvores de Decisão, Máquinas de Vetores de Suporte e o classificador Naïve Bayes, para prever o nível de sépsis e o plano terapêutico de doentes com sépsis. Relativamente à avaliação, utilizaram-se a Matriz de Confusão, incluindo as métricas associadas e a Cross-validation. De entre as métricas associadas na análise, foram utilizadas: a taxa de erro total, a sensibilidade, a especificidade e a acuidade, que permitiram identificar quais as medidas mais relevantes para a previsão do nível da sépsis e do plano terapêutico em estudo. Concluindo, foi possível prever com grande acuidade o nível de sépsis, no entanto, o mesmo já não é possível dizer no que diz respeito à medicação. Apesar de os modelos da sépsis terem bons resultados, o plano terapêutico não apresenta o mesmo nível de acuidade. Os resultados provam que de uma forma geral existe uma fraca correlação entre o nível de sépsis e o plano terapêutico, referente ao grupo de medicamentos. No entanto, é de salientar que para alguns grupos de medicamentos, os modelos tiveram uma bom desempenho (nível de acertos em algumas classes foi superior a 80%).One of the main problems existing in intensive medicine is related to the therapeutic plan, particularly what and when drugs must be administered to a patient. In the therapeutic plan it is crucial to make a rapid interpretation and accurate assessment of physiological data for efficient and effective decisionmaking by doctors. The present investigation aims to support doctor’s decision-making on predicting sepsis level and the best treatment for patients with microbiological problems based on sepsis levels. Thus, a set of Data Mining (DM) models was developed using forecasting techniques and classification models which will enable a doctor’s decision about appropriate therapy to apply, as well as the most successful one. The data used in DM models were collected at the Department of Intensive Care of the Hospital de Santo António, in Oporto, Portugal. Classification DM models where considered to predict sepsis level and therapeutic plan for patients with sepsis in a supervised learning approach. Models were induced making use of the following algorithms: Decision Trees, Support Vector Machines and Naïve Bayes classifier. Confusion Matrix, including associated metrics, and Cross-validation were used for the evaluation. Analysis of the total error rate, sensitivity, specificity and accuracy were the associated metrics used to identify the most relevant measures to predict sepsis level and treatment plan under study. In conclusion, it was possible to predict with great accuracy the sepsis level, but not the medication. Although the good sepsis models results attained, therapeutic plan does not present the same level of accuracy. The results have showed that in general there is a small correlation between sepsis level and therapeutic plan, considering the drugs group. However, for some drugs groups models the results are interesting (some classes exceeded 80% in terms of the accuracy level)

    Master of Science

    Get PDF
    thesisRespiratory Syncytial Virus (RSV), a major cause of bronchiolitis, has a large impact on the census of pediatric hospitals during outbreaks. Using readily available data, reliable prediction of the week these outbreaks will start could help pediatric hospitals better prepare for staffing and supplies. Naïve Bayes (NB) classifier models were constructed using weather data from 1985 to 2008 considering only variables that were available in real time and that could be used to forecast the week in which an RSV outbreak would occur in Salt Lake County, Utah (SLC). Outbreak start dates were documented by a panel of experts using 32,509 records with ICD-9 coded RSV and bronchiolitis diagnoses from Intermountain Healthcare hospitals and clinics for the RSV seasons from 1985 to 2008. NB models predicted RSV outbreaks up to three weeks in advance of the start date with an estimated sensitivity of up to 67% and estimated specificities as high as 94% to 100%. Temperature and wind speed were the best overall predictors, but other weather variables also showed relevance depending on how far in advance the predictions were made. The weather conditions predictive of an RSV outbreak in this study were similar to those that lead to temperature inversions in the Salt Lake Valley. We demonstrate that Naïve Bayes classifier models based on weather data available in real time have the potential to be used as effective predictive models. These models may be able to predict the week that an RSV outbreak will occur with clinical relevance. Their clinical usefulness will be field tested during the next five years

    Abstraction, aggregation and recursion for generating accurate and simple classifiers

    Get PDF
    An important goal of inductive learning is to generate accurate and compact classifiers from data. In a typical inductive learning scenario, instances in a data set are simply represented as ordered tuples of attribute values. In our research, we explore three methodologies to improve the accuracy and compactness of the classifiers: abstraction, aggregation, and recursion;Firstly, abstraction is aimed at the design and analysis of algorithms that generate and deal with taxonomies for the construction of compact and robust classifiers. In many applications of the data-driven knowledge discovery process, taxonomies have been shown to be useful in constructing compact, robust, and comprehensible classifiers. However, in many application domains, human-designed taxonomies are unavailable. We introduce algorithms for automated construction of taxonomies inductively from both structured (such as UCI Repository) and unstructured (such as text and biological sequences) data. We introduce AVT-Learner, an algorithm for automated construction of attribute value taxonomies (AVT) from data, and Word Taxonomy Learner (WTL), an algorithm for automated construction of word taxonomy from text and sequence data. We describe experiments on the UCI data sets and compare the performance of AVT-NBL (an AVT-guided Naive Bayes Learner) with that of the standard Naive Bayes Learner (NBL). Our results show that the AVTs generated by AVT-Learner are compeitive with human-generated AVTs (in cases where such AVTs are available). AVT-NBL using AVTs generated by AVT-Learner achieves classification accuracies that are comparable to or higher than those obtained by NBL; and the resulting classifiers are significantly more compact than those generated by NBL. Similarly, our experimental results of WTL and WTNBL on protein localization sequences and Reuters newswire text categorization data sets show that the proposed algorithms can generate Naive Bayes classifiers that are more compact and often more accurate than those produced by standard Naive Bayes learner for the Multinomial Model;Secondly, we apply aggregation to construct features as a multiset of values for the intrusion detection task. For this task, we propose a bag of system calls representation for system call traces and describe misuse and anomaly detection results on the University of New Mexico (UNM) and MIT Lincoln Lab (MIT LL) system call sequences with the proposed representation. With the feature representation as input, we compare the performance of several machine learning techniques for misuse detection and show experimental results on anomaly detection. The results show that standard machine learning and clustering techniques using the simple bag of system calls representation based on the system call traces generated by the operating system\u27s kernel is effective and often performs better than approaches that use foreign contiguous sequences in detecting intrusive behaviors of compromised processes;Finally, we construct a set of classifiers by recursive application of the Naive Bayes learning algorithms. Naive Bayes (NB) classifier relies on the assumption that the instances in each class can be described by a single generative model. This assumption can be restrictive in many real world classification tasks. We describe recursive Naive Bayes learner (RNBL), which relaxes this assumption by constructing a tree of Naive Bayes classifiers for sequence classification, where each individual NB classifier in the tree is based on an event model (one model for each class at each node in the tree). In our experiments on protein sequences, Reuters newswire documents and UC-Irvine benchmark data sets, we observe that RNBL substantially outperforms NB classifier. Furthermore, our experiments on the protein sequences and the text documents show that RNBL outperforms C4.5 decision tree learner (using tests on sequence composition statistics as the splitting criterion) and yields accuracies that are comparable to those of support vector machines (SVM) using similar information
    corecore