40 research outputs found
Locally weighted learning: How and when does it work in Bayesian networks?
© 2016, Taylor and Francis Ltd. All rights reserved. Bayesian network (BN), a simple graphical notation for conditional independence assertions, is promised to represent the probabilistic relationships between diseases and symptoms. Learning the structure of a Bayesian network classifier (BNC) encodes conditional independence assumption between attributes, which may deteriorate the classification performance. One major approach to mitigate the BNC’s primary weakness (the attributes independence assumption) is the locally weighted approach. And this type of approach has been proved to achieve good performance for naive Bayes, a BNC with simple structure. However, we do not know whether or how effective it works for improving the performance of the complex BNC. In this paper, we first do a survey on the complex structure models for BNCs and their improvements, then carry out a systematically experimental analysis to investigate the effectiveness of locally weighted method for complex BNCs, e.g., tree-augmented naive Bayes (TAN), averaged one-dependence estimators AODE and hidden naive Bayes (HNB), measured by classification accuracy (ACC) and the area under the ROC curve ranking (AUC). Experiments and comparisons on 36 benchmark data sets collected from University of California, Irvine (UCI) in Weka system demonstrate that locally weighting technologies just slightly outperforms unweighted complex BNCs on ACC and AUC. In other words, although locally weighting could significantly improve the performance of NB (a BNC with simple structure), it could not work well on BNCs with complex structures. This is because the performance improvements of BNCs are attributed to their structures not the locally weighting
Recommended from our members
The role of classifiers in feature selection: Number vs nature
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Wrapper feature selection approaches are widely used to select a small subset of relevant features from a dataset. However, Wrappers suffer from the fact that they only use a single classifier when selecting the features. The problem of using a single classifier is that each classifier is of a different nature and will have its own biases. This means that each classifier will select different feature subsets. To address this problem, this thesis aims to investigate the effects of using different classifiers for Wrapper feature selection. More specifically, it aims to investigate the effects of using different number of classifiers and classifiers of different nature.
This aim is achieved by proposing a new data mining method called Wrapper-based Decision Trees (WDT). The WDT method has the ability to combine multiple classifiers from four different families, including Bayesian Network, Decision Tree, Nearest Neighbour and Support Vector Machine, to select relevant features and visualise the relationships among the selected features using decision trees. Specifically, the WDT method is applied to investigate three research questions of this thesis: (1) the effects of number of classifiers on feature selection results; (2) the effects of nature of classifiers on feature selection results; and (3) which of the two (i.e., number or nature of classifiers) has more of an effect on feature selection results. Two types of user preference datasets derived from Human-Computer Interaction (HCI) are used with WDT to assist in answering these three research questions.
The results from the investigation revealed that the number of classifiers and nature of classifiers greatly affect feature selection results. In terms of number of classifiers, the results showed that few classifiers selected many relevant features whereas many classifiers selected few relevant features. In addition, it was found that using three classifiers resulted in highly accurate feature subsets. In terms of nature of classifiers, it was showed that Decision Tree, Bayesian Network and Nearest Neighbour classifiers caused signficant differences in both the number of features selected and the accuracy levels of the features. A comparison of results regarding number of classifiers and nature of classifiers revealed that the former has more of an effect on feature selection than the latter.
The thesis makes contributions to three communities: data mining, feature selection, and HCI. For the data mining community, this thesis proposes a new method called WDT which integrates the use of multiple classifiers for feature selection and decision trees to effectively select and visualise the most relevant features within a dataset. For the feature selection community, the results of this thesis have showed that the number of classifiers and nature of classifiers can truly affect the feature selection process. The results and suggestions based on the results can provide useful insight about classifiers when performing feature selection. For the HCI community, this thesis has showed the usefulness of feature selection for identifying a small number of highly relevant features for determining the preferences of different users
Performance Evaluation of Smart Decision Support Systems on Healthcare
Medical activity requires responsibility not only from clinical knowledge and skill but
also on the management of an enormous amount of information related to patient care. It is
through proper treatment of information that experts can consistently build a healthy wellness
policy. The primary objective for the development of decision support systems (DSSs) is
to provide information to specialists when and where they are needed. These systems provide
information, models, and data manipulation tools to help experts make better decisions in a
variety of situations.
Most of the challenges that smart DSSs face come from the great difficulty of dealing
with large volumes of information, which is continuously generated by the most diverse types
of devices and equipment, requiring high computational resources. This situation makes this
type of system susceptible to not recovering information quickly for the decision making. As a
result of this adversity, the information quality and the provision of an infrastructure capable
of promoting the integration and articulation among different health information systems (HIS)
become promising research topics in the field of electronic health (e-health) and that, for this
same reason, are addressed in this research. The work described in this thesis is motivated
by the need to propose novel approaches to deal with problems inherent to the acquisition,
cleaning, integration, and aggregation of data obtained from different sources in e-health environments,
as well as their analysis.
To ensure the success of data integration and analysis in e-health environments, it
is essential that machine-learning (ML) algorithms ensure system reliability. However, in this
type of environment, it is not possible to guarantee a reliable scenario. This scenario makes
intelligent SAD susceptible to predictive failures, which severely compromise overall system
performance. On the other hand, systems can have their performance compromised due to the
overload of information they can support.
To solve some of these problems, this thesis presents several proposals and studies
on the impact of ML algorithms in the monitoring and management of hypertensive disorders
related to pregnancy of risk. The primary goals of the proposals presented in this thesis are
to improve the overall performance of health information systems. In particular, ML-based
methods are exploited to improve the prediction accuracy and optimize the use of monitoring
device resources. It was demonstrated that the use of this type of strategy and methodology
contributes to a significant increase in the performance of smart DSSs, not only concerning precision
but also in the computational cost reduction used in the classification process.
The observed results seek to contribute to the advance of state of the art in methods
and strategies based on AI that aim to surpass some challenges that emerge from the integration
and performance of the smart DSSs. With the use of algorithms based on AI, it is possible to
quickly and automatically analyze a larger volume of complex data and focus on more accurate
results, providing high-value predictions for a better decision making in real time and without
human intervention.A atividade médica requer responsabilidade não apenas com base no conhecimento
e na habilidade clínica, mas também na gestão de uma enorme quantidade de informações
relacionadas ao atendimento ao paciente. É através do tratamento adequado das informações
que os especialistas podem consistentemente construir uma política saudável de bem-estar. O
principal objetivo para o desenvolvimento de sistemas de apoio à decisão (SAD) é fornecer informações
aos especialistas onde e quando são necessárias. Esses sistemas fornecem informações,
modelos e ferramentas de manipulação de dados para ajudar os especialistas a tomar melhores
decisões em diversas situações.
A maioria dos desafios que os SAD inteligentes enfrentam advêm da grande dificuldade
de lidar com grandes volumes de dados, que é gerada constantemente pelos mais diversos
tipos de dispositivos e equipamentos, exigindo elevados recursos computacionais. Essa situação
torna este tipo de sistemas suscetível a não recuperar a informação rapidamente para a
tomada de decisão. Como resultado dessa adversidade, a qualidade da informação e a provisão
de uma infraestrutura capaz de promover a integração e a articulação entre diferentes sistemas
de informação em saúde (SIS) tornam-se promissores tópicos de pesquisa no campo da saúde
eletrônica (e-saúde) e que, por essa mesma razão, são abordadas nesta investigação. O trabalho
descrito nesta tese é motivado pela necessidade de propor novas abordagens para lidar
com os problemas inerentes à aquisição, limpeza, integração e agregação de dados obtidos de
diferentes fontes em ambientes de e-saúde, bem como sua análise.
Para garantir o sucesso da integração e análise de dados em ambientes e-saúde é
importante que os algoritmos baseados em aprendizagem de máquina (AM) garantam a confiabilidade
do sistema. No entanto, neste tipo de ambiente, não é possível garantir um cenário
totalmente confiável. Esse cenário torna os SAD inteligentes suscetíveis à presença de falhas
de predição que comprometem seriamente o desempenho geral do sistema. Por outro lado, os
sistemas podem ter seu desempenho comprometido devido à sobrecarga de informações que
podem suportar.
Para tentar resolver alguns destes problemas, esta tese apresenta várias propostas e
estudos sobre o impacto de algoritmos de AM na monitoria e gestão de transtornos hipertensivos
relacionados com a gravidez (gestação) de risco. O objetivo das propostas apresentadas nesta
tese é melhorar o desempenho global de sistemas de informação em saúde. Em particular, os
métodos baseados em AM são explorados para melhorar a precisão da predição e otimizar o
uso dos recursos dos dispositivos de monitorização. Ficou demonstrado que o uso deste tipo
de estratégia e metodologia contribui para um aumento significativo do desempenho dos SAD
inteligentes, não só em termos de precisão, mas também na diminuição do custo computacional
utilizado no processo de classificação.
Os resultados observados buscam contribuir para o avanço do estado da arte em métodos
e estratégias baseadas em inteligência artificial que visam ultrapassar alguns desafios que
advêm da integração e desempenho dos SAD inteligentes. Como o uso de algoritmos baseados
em inteligência artificial é possível analisar de forma rápida e automática um volume maior de
dados complexos e focar em resultados mais precisos, fornecendo previsões de alto valor para uma melhor tomada de decisão em tempo real e sem intervenção humana
Methods to Improve Virtual Screening of Potential Drug Leads for Specific Pharmacodynamic and Toxicological Properties
Ph.DDOCTOR OF PHILOSOPH
The role of classifiers in feature selection : number vs nature
Wrapper feature selection approaches are widely used to select a small subset of relevant features from a dataset. However, Wrappers suffer from the fact that they only use a single classifier when selecting the features. The problem of using a single classifier is that each classifier is of a different nature and will have its own biases. This means that each classifier will select different feature subsets. To address this problem, this thesis aims to investigate the effects of using different classifiers for Wrapper feature selection. More specifically, it aims to investigate the effects of using different number of classifiers and classifiers of different nature. This aim is achieved by proposing a new data mining method called Wrapper-based Decision Trees (WDT). The WDT method has the ability to combine multiple classifiers from four different families, including Bayesian Network, Decision Tree, Nearest Neighbour and Support Vector Machine, to select relevant features and visualise the relationships among the selected features using decision trees. Specifically, the WDT method is applied to investigate three research questions of this thesis: (1) the effects of number of classifiers on feature selection results; (2) the effects of nature of classifiers on feature selection results; and (3) which of the two (i.e., number or nature of classifiers) has more of an effect on feature selection results. Two types of user preference datasets derived from Human-Computer Interaction (HCI) are used with WDT to assist in answering these three research questions. The results from the investigation revealed that the number of classifiers and nature of classifiers greatly affect feature selection results. In terms of number of classifiers, the results showed that few classifiers selected many relevant features whereas many classifiers selected few relevant features. In addition, it was found that using three classifiers resulted in highly accurate feature subsets. In terms of nature of classifiers, it was showed that Decision Tree, Bayesian Network and Nearest Neighbour classifiers caused signficant differences in both the number of features selected and the accuracy levels of the features. A comparison of results regarding number of classifiers and nature of classifiers revealed that the former has more of an effect on feature selection than the latter. The thesis makes contributions to three communities: data mining, feature selection, and HCI. For the data mining community, this thesis proposes a new method called WDT which integrates the use of multiple classifiers for feature selection and decision trees to effectively select and visualise the most relevant features within a dataset. For the feature selection community, the results of this thesis have showed that the number of classifiers and nature of classifiers can truly affect the feature selection process. The results and suggestions based on the results can provide useful insight about classifiers when performing feature selection. For the HCI community, this thesis has showed the usefulness of feature selection for identifying a small number of highly relevant features for determining the preferences of different users.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
The classification performance of Bayesian Networks Classifiers: a case study of detecting Denial of Service (DoS) attacks in cloud computing environments
In this research we propose a Bayesian networks approach as a promissory classification technique for detecting malicious traffic due to Denial of Service (DoS) attacks. Bayesian networks have been applied in numerous fields fraught with uncertainty and they have been proved to be successful. They have excelled tremendously in classification tasks i.e. text analysis, medical diagnoses and environmental modeling and management. The detection of DoS attacks has received tremendous attention in the field of network security. DoS attacks have proved to be detrimental and are the bane of cloud computing environments. Large business enterprises have been/or are still unwilling to outsource their businesses to the cloud due to the intrusive tendencies that the cloud platforms are prone too. To make use of Bayesian networks it is imperative to understand the ―ecosystem‖ of factors that are external to modeling the Bayesian algorithm itself. Understanding these factors have proven to result in comparable improvement in classification performance beyond the augmentation of the existing algorithms. Literature provides discussions pertaining to the factors that impact the classification capability, however it was noticed that the effects of the factors are not universal, they tend to be unique for each domain problem. This study investigates the effects of modeling parameters on the classification performance of Bayesian network classifiers in detecting DoS attacks in cloud platforms. We analyzed how structural complexity, training sample size, the choice of discretization method and lastly the score function both individually and collectively impact the performance of classifying between normal and DoS attacks on the cloud. To study the aforementioned factors, we conducted a series of experiments in detecting live DoS attacks launched against a deployed cloud and thereafter examined the classification performance in terms of accuracy of different classes of Bayesian networks. NSL-KDD dataset was used as our training set. We used ownCloud software to deploy our cloud platform. To launch DoS attacks, we used hping3 hacker friendly utility. A live packet capture was used as our test set. WEKA version 3.7.12 was used for our experiments. Our results show that the progression in model complexity improves the classification performance. This is attributed to the increase in the number of attribute correlations. Also the size of the training sample size proved to improve classification ability. Our findings noted that the choice of discretization algorithm does matter in the quest for optimal classification performance. Furthermore, our results indicate that the choice of scoring function does not affect the classification performance of Bayesian networks. Conclusions drawn from this research are prescriptive particularly for a novice machine learning researcher with valuable recommendations that ensure optimal classification performance of Bayesian networks classifiers
Recommended from our members
The influence of human factors on user's preferences of web-based applications: A data mining approach
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University on 20/12/2010.As the Web is fast becoming an integral feature in many of our daily lives, designers are faced with the challenge of designing Web-based applications for an increasingly diverse user group. In order to develop applications that successfully meet the needs of this user group, designers have to understand the influence of human factors upon users‘ needs and preferences. To address this issue, this thesis presents an investigation that analyses the influence of three human factors, including cognitive style, prior knowledge and gender differences, on users‘ preferences for Web-based applications. In particular, two applications are studied: Web search tools and Web-based instruction tools. Previous research has suggested a number of relationships between these three human factors, so this thesis was driven by three research questions. Firstly, to what extent is the similarity between the two cognitive style dimensions of Witkin‘s Field Dependence/Independence and Pask‘s Holism/Serialism? Secondly, to what extent do computer experts have the same preferences as Internet experts and computer novices have the same preferences as Internet novices? Finally, to what extent are Field Independent users, experts and males alike, and Field Dependent users, novices and females alike? As traditional statistical analysis methods would struggle to effectively capture such relationships, this thesis proposes an integrated data mining approach that combines feature selection and decision trees to effectively capture users‘ preferences. From this, a framework is developed that integrates the combined effect of the three human factors and can be used to inform system designers.
The findings suggest that firstly, there are links between these three human factors. In terms of cognitive style, the relationship between Field Dependent users and Holists can be seen more clearly than the relationship between Field Independent users and Serialists. In terms of prior knowledge, although it is shown that there is a link between computer experience and Internet experience, computer experts are shown to have similar preferences to Internet novices. In terms of the relationship between all three human factors, the results of this study highlighted that the links between cognitive style and gender and between cognitive style and system experience were found to be stronger than the relationship between system experience and gender. This work contributes both theory and methodology to multiple academic communities, including human-computer interaction, information retrieval and data mining. In terms of theory, it has helped to deepen the understanding of the effects of single and multiple human factors on users‘ preferences for Web-based applications. In terms of methodology, an integrated data mining analysis approach was proposed and was shown that is able to capture users‘ preferences
Modified Mahalanobis Taguchi System for Imbalance Data Classification
The Mahalanobis Taguchi System (MTS) is considered one of the most promising binary classification algorithms to handle imbalance data. Unfortunately, MTS lacks a method for determining an efficient threshold for the binary classification. In this paper, a nonlinear optimization model is formulated based on minimizing the distance between MTS Receiver Operating Characteristics (ROC) curve and the theoretical optimal point named Modified Mahalanobis Taguchi System (MMTS). To validate the MMTS classification efficacy, it has been benchmarked with Support Vector Machines (SVMs), Naive Bayes (NB), Probabilistic Mahalanobis Taguchi Systems (PTM), Synthetic Minority Oversampling Technique (SMOTE), Adaptive Conformal Transformation (ACT), Kernel Boundary Alignment (KBA), Hidden Naive Bayes (HNB), and other improved Naive Bayes algorithms. MMTS outperforms the benchmarked algorithms especially when the imbalance ratio is greater than 400. A real life case study on manufacturing sector is used to demonstrate the applicability of the proposed model and to compare its performance with Mahalanobis Genetic Algorithm (MGA)