12 research outputs found

    A Probability-based Evolutionary Algorithm with Mutations to Learn Bayesian Networks

    Get PDF
    Bayesian networks are regarded as one of the essential tools to analyze causal relationship between events from data. To learn the structure of highly-reliable Bayesian networks from data as quickly as possible is one of the important problems that several studies have been tried to achieve. In recent years, probability-based evolutionary algorithms have been proposed as a new efficient approach to learn Bayesian networks. In this paper, we target on one of the probability-based evolutionary algorithms called PBIL (Probability-Based Incremental Learning), and propose a new mutation operator. Through performance evaluation, we found that the proposed mutation operator has a good performance in learning Bayesian networks

    Bayesian Machine Learning Techniques for revealing complex interactions among genetic and clinical factors in association with extra-intestinal Manifestations in IBD patients

    Get PDF
    The objective of the study is to assess the predictive performance of three different techniques as classifiers for extra-intestinal manifestations in 152 patients with Crohn's disease. Na\uefve Bayes, Bayesian Additive Regression Trees and Bayesian Networks implemented using a Greedy Thick Thinning algorithm for learning dependencies among variables and EM algorithm for learning conditional probabilities associated to each variable are taken into account. Three sets of variables were considered: (i) disease characteristics: presentation, behavior and location (ii) risk factors: age, gender, smoke and familiarity and (iii) genetic polymorphisms of the NOD2, CD14, TNFA, IL12B, and IL1RN genes, whose involvement in Crohn's disease is known or suspected. Extra-intestinal manifestations occurred in 75 patients. Bayesian Networks achieved accuracy of 82% when considering only clinical factors and 89% when considering also genetic information, outperforming the other techniques. CD14 has a small predicting capability. Adding TNFA, IL12B to the 3020insC NOD2 variant improved the accuracy

    Distributed Knowledge Discovery in Large Scale Peer-to-Peer Networks

    Get PDF
    Explosive growth in the availability of various kinds of data in distributed locations has resulted in unprecedented opportunity to develop distributed knowledge discovery (DKD) techniques. DKD embraces the growing trend of merging computation with communication by performing distributed data analysis and modeling with minimal communication of data. Most of the current state-of-the-art DKD systems suffer from the lack of scalability, robustness and adaptability due to their dependence on a centralized model for building the knowledge discovery model. Peer-to-Peer networks offer a better scalable and fault-tolerant computing platform for building distributed knowledge discovery models than client-server based platforms. Algorithms and communication protocols have been developed for file search and discovery services in peer-to-peer networks. The file search algorithms are concerned with identification of a peer and discovery of a file on that specified peer, so most of the current peer-to-peer networks for file search act as directory services. The problem of distributed knowledge discovery is different from file search services, however new issues and challenges have to be addressed. The algorithms and communication protocols for knowledge discovery deal with implementing algorithms by which every peer in the network discovers the correct knowledge discovery model, as if it were given the combined database. Therefore, algorithms and communication protocols for DKD mainly deal with distributed computing. The distributed computations are entirely asynchronous, impose very little communication overhead, transparently tolerate network topology changes and peer failures and quickly adjust to changes in the data as they occur. Another important aspect of the distributed computations in a peer-to-peer network is that most of the communication between peer nodes is local i.e. the knowledge discovery model is learned at each peer using information gathered from a very small neighborhood, whose size is independent of the size of the peer-to-peer network. The peer-to-peer constraints on data and/or computing are the hard ones, so the challenge is to show that it is still possible to extract useful information from the distributed data effectively and dependably. The implementation of a distributed algorithm in an asynchronous and decentralized environment is the hardest challenge. DKD in a peer-to-peer network raises issues related to impracticality of global communications and global synchronization, on-the-fly data updates, lack of control, accuracy of computation, the need to share resources with other applications, and frequent failure and recovery of resources. We propose a methodology based on novel distributed algorithms and communication protocols to perform DKD in a peer-to-peer network. We investigate the performance of our algorithms and communication protocols by means of analysis and simulations

    Data Mining of Biomedical Databases

    Get PDF
    Data mining can be defined as the nontrivial extraction of implicit, previously unknown and potentially useful information from data. This thesis is focused on Data Mining in Biomedicine, representing one of the most interesting fields of application. Different kinds of biomedical data sets would require different data mining approaches. Two approaches are treated in this thesis, divided in two separate and independent parts. The first part deals with Bayesian Networks, representing one of the most successful tools for medical diagnosis and therapies follow-up. Formally, a Bayesian Network (BN) is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph. An algorithm for Bayesian network structure learning that is a variation of the standard search-and-score approach has been developed. The proposed approach overcomes the creation of redundant network structures that may include non significant connections between variables. In particular, the algorithm finds which relationships between the variables must be prevented, by exploiting the binarization of a square matrix containing the mutual information (MI) among all pairs of variables. Four different binarization methods are implemented. The MI binary matrix is exploited as a pre-conditioning step for the subsequent greedy search procedure that optimizes the network score, reducing the number of possible search paths in the greedy search procedure. This approach has been tested on four different datasets and compared against the standard search-and-score algorithm as implemented in the DEAL package, with successful results. Moreover, a comparison among different network scores has been performed. The second part of this thesis is focused on data mining of microarray databases. An algorithm able to perform the analysis of Illumina microRNA microarray data in a systematic and easy way has been developed. The algorithm includes two parts. The first part is the pre-processing, characterized by two steps: variance stabilization and normalization. Variance stabilization has to be performed to abrogate or at least reduce the heteroskedasticity while normalization has to be performed to minimize systematic effects that are not constant among different samples of an experiment and that are not due to the factors under investigation. Three alternative variance stabilization strategies and three alternative normalization approaches are included. So, considering all the possible combinations between variance stabilization and normalization strategies, 9 different ways to pre-process the data are obtained. The second part of the algorithm deals with the statistical analysis for the differential expression detection. Linear models and empirical Bayes methods are used. The final result is the list of the microRNAs significantly differentially-expressed in two different conditions. The algorithm has been tested on three different real datasets and partially validated with an independent approach (quantitative real time PCR). Moreover, the influence of the use of different preprocessing methods on the discovery of differentially expressed microRNAs has been studied and a comparison among the different normalization methods has been performed. This is the first study comparing normalization techniques for Illumina microRNA microarray data

    Advances in Evolutionary Algorithms

    Get PDF
    With the recent trends towards massive data sets and significant computational power, combined with evolutionary algorithmic advances evolutionary computation is becoming much more relevant to practice. Aim of the book is to present recent improvements, innovative ideas and concepts in a part of a huge EA field

    Estimation of distribution algorithms in logistics : Analysis, design, and application

    Get PDF
    This thesis considers the analysis, design and application of Estimation of Distribution Algorithms (EDA) in Logistics. It approaches continouos nonlinear optimization problems (standard test problems and stochastic transportation problems) as well as location problems, strategic safety stock placement problems and lotsizing problems. The thesis adds to the existing literature by proposing theoretical advances for continuous EDAs and practical applications of discrete EDAs. Thus, it should be of interest for researchers from evolutionary computation, as well as practitioners that are in need of efficient algorithms for the above mentioned problems

    Apprentissage de la structure de réseaux bayésiens : application aux données de génétique-génomique

    Get PDF
    Apprendre la structure d'un réseau de régulation de gènes est une tâche complexe due à la fois au nombre élevé de variables le composant (plusieurs milliers) et à la faible quantité d'échantillons disponibles (quelques centaines). Parmi les approches proposées, nous utilisons le formalisme des réseaux bayésiens, ainsi apprendre la structure d'un réseau de régulation consiste à apprendre la structure d'un réseau bayésien où chaque variable représente un gène et chaque arc un phénomène de régulation. Dans la première partie de ce manuscrit nous nous intéressons à l'apprentissage de la structure de réseaux bayésiens génériques au travers de recherches locales. Nous explorons plus efficacement l'espace des réseaux possibles grâce à un nouvel algorithme de recherche stochastique (SGS), un nouvel opérateur local (SWAP), ainsi qu'une extension des opérateurs classiques qui permet d'assouplir temporairement la contrainte d'acyclicité des réseaux bayésiens. La deuxième partie se focalise sur l'apprentissage de réseaux de régulation de gènes. Nous proposons une modélisation du problème dans le cadre des réseaux bayésiens prenant en compte deux types d'information. Le premier, classiquement utilisé, est le niveau d'expression des gènes. Le second, plus original, est la présence de mutations sur la séquence d'ADN pouvant expliquer des variations d'expression. L'utilisation de ces données combinées dites de génétique-génomique, vise à améliorer la reconstruction. Nos différentes propositions se sont montrées performantes sur des données de génétique-génomique simulées et ont permis de reconstruire un réseau de régulation pour des données observées sur le plante Arabidopsis thaliana.Structure learning of gene regulatory networks is a complex process, due to the high number of variables (several thousands) and the small number of available samples (few hundred). Among the proposed approaches to learn these networks, we use the Bayesian network framework. In this way to learn a regulatory network corresponds to learn the structure of a Bayesian network where each variable is a gene and each edge represents a regulation between genes. In the first part of this thesis, we are interested in learning the structure of generic Bayesian networks using local search. We explore more efficiently the search space thanks to a new stochastic search algorithm (SGS), a new local operator (SWAP) and an extension for classical operators to briefly overcome the acyclic constraint imposed by Bayesian networks. The second part focuses on learning gene regulatory networks. We proposed a model in the Bayesian networks framework taking into account two kinds of information. The first one, commonly used, is gene expression levels. The second one, more original, is the mutations on the DNA sequence which can explain gene expression variations. The use of these combined data, called genetical genomics, aims to improve the structural learning quality. Our different proposals appeared to be efficient on simulated genetical genomics data and allowed to learn a regulatory network for observed data from Arabidopsis thaliana

    Algoritmos de aprendizagem adaptativos para classificadores de redes Bayesianas

    Get PDF
    Doutoramento em MatemáticaNesta tese consideramos o desenvolvimento de algoritmos adaptativos para classificadores de redes Bayesianas (BNCs) num cenário on-line. Neste cenário os dados são apresentados sequencialmente. O modelo de decisão primeiro faz uma predição e logo este é actualizado com os novos dados. Um cenário on-line de aprendizagem corresponde ao cenário “prequencial” proposto por Dawid. Um algoritmo de aprendizagem num cenário prequencial é eficiente se este melhorar o seu desempenho dedutivo e, ao mesmo tempo, reduzir o custo da adaptação. Por outro lado, em muitas aplicações pode ser difícil melhorar o desempenho e adaptar-se a fluxos de dados que apresentam mudança de conceito. Neste caso, os algoritmos de aprendizagem devem ser dotados com estratégias de controlo e adaptação que garantem o ajuste rápido a estas mudanças. Todos os algoritmos adaptativos foram integrados num modelo conceptual de aprendizagem adaptativo e prequencial para classificação supervisada designado AdPreqFr4SL, o qual tem como objectivo primordial atingir um equilíbrio óptimo entre custo-qualidade e controlar a mudança de conceito. O equilíbrio entre custo-qualidade é abordado através do controlo do viés (bias) e da adaptação do modelo. Em vez de escolher uma única classe de BNCs durante todo o processo, propomo-nos utilizar a classe de classificadores Bayesianos k-dependentes (k-DBCs) e começar com o seu modelo mais simples: o classificador Naïve Bayes (NB) (quando o número máximo de dependências permissíveis entre os atributos, k, é 0). Podemos melhorar o desempenho do NB se reduzirmos o bias produto das restrições de independência. Com este fim, propomo-nos incrementar k gradualmente de forma a que em cada etapa de aprendizagem sejam seleccionados modelos de k-DBCs com uma complexidade crescente que melhor se vai ajustando ao actual montante de dados. Assim podemos evitar os problemas causados por demasiado viés (underfitting) ou demasiada variância (overfiting). Por outro lado, a adaptação da estrutura de um BNC com novos dados implica um custo computacional elevado. Propomo-nos reduzir nos custos da adaptação se, sempre que possível, usarmos os novos dados para adaptar os parâmetros. A estrutura é adaptada só em momentos esporádicos, quando é detectado que a sua adaptação é vital para atingir uma melhoria no desempenho. Para controlar a mudança de conceito, incluímos um método baseado no Controlo de Qualidade Estatístico que tem mostrado ser efectivo na detecção destas mudanças. Avaliamos os algoritmos adaptativos usando a classe de classificadores k-DBC em diferentes problemas artificiais e reais e mostramos as vantagens da sua implementação quando comparado com as versões no adaptativas.This thesis mainly addresses the development of adaptive learning algorithms for Bayesian network classifiers (BNCs) in an on-line leaning scenario. In this scenario data arrives at the learning system sequentially. The actual predictive model must first make a prediction and then update the current model with new data. This scenario corresponds to the Dawid’s prequential approach for statistical validation of models. An efficient adaptive algorithm in a prequential learning framework must be able, above all, to improve its predictive accuracy over time while reducing the cost of adaptation. However, in many real-world situations it may be difficult to improve and adapt to existing changing environments, a problem known as concept drift. In changing environments, learning algorithms should be provided with some control and adaptive mechanisms that effort to adjust quickly to these changes. We have integrated all the adaptive algorithms into an adaptive prequential framework for supervised learning called AdPreqFr4SL, which attempts to handle the cost-performance trade-off and also to cope with concept drift. The cost-quality trade-off is approached through bias management and adaptation control. The rationale is as follows. Instead of selecting a particular class of BNCs and using it during all the learning process, we use the class of k-Dependence Bayesian classifiers and start with the simple Naïve Bayes (by setting the maximum number of allowable attribute dependence k to 0). We can then improve the performance of Naïve Bayes over time if we trade-off the bias reduction which leads to the addition of new attribute dependencies with the variance reduction by accurately estimating the parameters. However, as the learning process advances we should place more focus on bias management. We reduce the bias resulting from the independence assumption by gradually adding dependencies between the attributes over time. To this end, we gradually increase k so that at each learning step we can use a class-model of k-DBCs that better suits the available data. Thus, we can avoid the problems caused by either too much bias (underfitting) or too much variance (overfitting). On the other hand, updating the structure of BNCs with new data is a very costly task. Hence some adaptation control is desirable to decide whether it is inevitable to adapt the structure. We reduce the cost of updating by using new data to primarily adapt the parameters. Only when it is detected that the use of the current structure no longer guarantees the desirable improvement in the performance, do we adapt the structure. To handle concept drift, our framework includes a method based on Statistical Quality Control, which has been demonstrated to be efficient for recognizing concept changes. We experimentally evaluated the AdPreqFr4SL on artificial domains and benchmark problems and show its advantages in comparison against its nonadaptive versions

    Learning Bayesian networks in the space of structures by estimation of distribution algorithms

    Full text link
    The induction of the optimal Bayesian network structure is NP-hard, justifying the use of search heuristics. Two novel population-based stochastic search approaches, univariate marginal distribution algorithm (UMDA) and population-based incremental learning (PBIL), are used to learn a Bayesian network structure from a database of cases in a score search framework. A comparison with a genetic algorithm (GA) approach is performed using three different scores: penalize maximum likelihood, marginal likelihood, and information-theory– based entropy. Experimental results show the interesting capabilities of both novel approaches with respect to the score value and the number of generations needed to converge
    corecore