35 research outputs found

    From 'tree' based Bayesian networks to mutual information classifiers : deriving a singly connected network classifier using an information theory based technique

    Get PDF
    For reasoning under uncertainty the Bayesian network has become the representation of choice. However, except where models are considered 'simple' the task of construction and inference are provably NP-hard. For modelling larger 'real' world problems this computational complexity has been addressed by methods that approximate the model. The Naive Bayes classifier, which has strong assumptions of independence among features, is a common approach, whilst the class of trees is another less extreme example. In this thesis we propose the use of an information theory based technique as a mechanism for inference in Singly Connected Networks. We call this a Mutual Information Measure classifier, as it corresponds to the restricted class of trees built from mutual information. We show that the new approach provides for both an efficient and localised method of classification, with performance accuracies comparable with the less restricted general Bayesian networks. To improve the performance of the classifier, we additionally investigate the possibility of expanding the class Markov blanket by use of a Wrapper approach and further show that the performance can be improved by focusing on the class Markov blanket and that the improvement is not at the expense of increased complexity. Finally, the two methods are applied to the task of diagnosing the 'real' world medical domain, Acute Abdominal Pain. Known to be both a different and challenging domain to classify, the objective was to investigate the optiniality claims, in respect of the Naive Bayes classifier, that some researchers have argued, for classifying in this domain. Despite some loss of representation capabilities we show that the Mutual Information Measure classifier can be effectively applied to the domain and also provides a recognisable qualitative structure without violating 'real' world assertions. In respect of its 'selective' variant we further show that the improvement achieves a comparable predictive accuracy to the Naive Bayes classifier and that the Naive Bayes classifier's 'overall' performance is largely due the contribution of the majority group Non-Specific Abdominal Pain, a group of exclusion

    A probabilistic reasoning and learning system based on Bayesian belief networks

    Get PDF
    SIGLEAvailable from British Library Document Supply Centre- DSC:DX173015 / BLDSC - British Library Document Supply CentreGBUnited Kingdo

    Estimation of Distribution Algorithms and Minimum Relative Entropy

    Get PDF
    In the field of optimization using probabilistic models of the search space, this thesis identifies and elaborates several advancements in which the principles of maximum entropy and minimum relative entropy from information theory are used to estimate a probability distribution. The probability distribution within the search space is represented by a graphical model (factorization, Bayesian network or junction tree). An estimation of distribution algorithm (EDA) is an evolutionary optimization algorithm which uses a graphical model to sample a population within the search space and then estimates a new graphical model from the selected individuals of the population. - So far, the Factorized Distribution Algorithm (FDA) builds a factorization or Bayesian network from a given additive structure of the objective function to be optimized using a greedy algorithm which only considers a subset of the variable dependencies. Important connections can be lost by this method. This thesis presents a heuristic subfunction merge algorithm which is able to consider all dependencies between the variables (as long as the marginal distributions of the model do not become too large). On a 2-D grid structure, this algorithm builds a pentavariate factorization which allows to solve the deceptive grid benchmark problem with a much smaller population size than the conventional factorization. Especially for small population sizes, calculating large marginal distributions from smaller ones using Maximum Entropy and iterative proportional fitting leads to a further improvement. - The second topic is the generalization of graphical models to loopy structures. Using the Bethe-Kikuchi approximation, the loopy graphical model (region graph) can learn the Boltzmann distribution of an objective function by a generalized belief propagation algorithm (GBP). It minimizes the free energy, a notion adopted from statistical physics which is equivalent to the relative entropy to the Boltzmann distribution. Previous attempts to combine the Kikuchi approximation with EDA have relied on an expensive Gibbs sampling procedure for generating a population from this loopy probabilistic model. In this thesis a combination with a factorization is presented which allows more efficient sampling. The free energy is generalized to incorporate the inverse temperature ß. The factorization building algorithm mentioned above can be employed here, too. The dynamics of GBP is investigated, and the method is applied on Ising spin glass ground state search. Small instances (7 x 7) are solved without difficulty. Larger instances (10 x 10 and 15 x 15) do not converge to the true optimum with large ß, but sampling from the factorization can find the optimum with about 1000-10000 sampling attempts, depending on the instance. If GBP does not converge, it can be replaced by a concave-convex procedure which guarantees convergence. - Third, if no probabilistic structure is given for the objective function, a Bayesian network can be learned to capture the dependencies in the population. The relative entropy between the population-induced distribution and the Bayesian network distribution is equivalent to the log-likelihood of the model. The log-likelihood has been generalized to the BIC/MDL score which reduces overfitting by punishing complicated structure of the Bayesian network. A previous information theoretic analysis of BIC/MDL in the context of EDA is continued, and empiric evidence is given that the method is able to learn the correct structure of an objective function, given a sufficiently large population. - Finally, a way to reduce the search space of EDA is presented by combining it with a local search heuristics. The Kernighan Lin hillclimber, known originally for the traveling salesman problem and graph bipartitioning, is generalized to arbitrary binary problems. It can be applied in a stand-alone manner, as an iterative 1+1 search algorithm, or combined with EDA. On the MAXSAT problem it performs in a similar scale to the specialized SAT solver Walksat. An analysis of the Kernighan Lin local optima indicates that the combination with an EDA is favorable. The thesis shows how evolutionary optimization can be improved using interdisciplinary results from information theory, statistics, probability calculus and statistical physics. The principles of information theory for estimating probability distributions are applicable in many areas. EDAs are a good application because an improved estimation affects directly the optimization success.Estimation of Distribution Algorithms und Minimierung der relativen Entropie Im Bereich der Optimierung mit probabilistischen Modellen des Suchraums werden einige Fortschritte identifiziert und herausgearbeitet, in denen die Prinzipien der maximalen Entropie und der minimalen relativen Entropie aus der Informationstheorie verwendet werden, um eine Wahrscheinlichkeitsverteilung zu schätzen. Die Wahrscheinlichkeitsverteilung im Suchraum wird durch ein graphisches Modell beschrieben (Faktorisierung, Bayessches Netz oder Verbindungsbaum). Ein Estimation of Distribution Algorithm (EDA) ist ein evolutionärer Optimierungsalgorithmus, der mit Hilfe eines graphischen Modells eine Population im Suchraum erzeugt und dann anhand der selektierten Individuen dieser Population ein neues graphisches Modell erzeugt. - Bislang baut der Factorized Distribution Algorithm (FDA) eine Faktorisierung oder ein Bayessches Netz aus einer gegebenen additiven Struktur der Zielfunktion durch einen Greedy-Algorithmus, der nur einen Teil der Verbindungen zwischen den Variablen berücksichtigt. Wichtige verbindungen können durch diese Methode verloren gehen. Diese Arbeit stellt einen heuristischen Subfunktionenverschmelzungsalgorithmus vor, der in der Lage ist, alle Abhängigkeiten zwischen den Variablen zu berücksichtigen (wofern die Randverteilungen des Modells nicht zu groß werden). Auf einem 2D-Gitter erzeugt dieser Algorithmus eine pentavariate Faktorisierung, die es ermöglicht, das Deceptive-Grid-Testproblem mit viel kleinerer Populationsgröße zu lösen als mit der konventionellen Faktorisierung. Insbesondere für kleine Populationsgrößen kann das Ergebnis noch verbessert werden, wenn große Randverteilungen aus kleineren vermittels des Prinzips der maximalen Entropie und des Iterative Proportional Fitting- Algorithmus berechnet werden. - Das zweite Thema ist die Verallgemeinerung graphischer Modelle zu zirkulären Strukturen. Mit der Bethe-Kikuchi-Approximation kann das zirkuläre graphische Modell (der Regionen-Graph) die Boltzmannverteilung einer Zielfunktion durch einen generalisierten Belief Propagation-Algorithmus (GBP) lernen. Er minimiert die freie Energie, eine Größe aus der statistischen Physik, die äquivalent zur relativen Entropie zur Boltzmannverteilung ist. Frühere Versuche, die Kikuchi-Approximation mit EDA zu verbinden, benutzen einen aufwendigen Gibbs-Sampling-Algorithmus, um eine Population aus dem zirkulären Wahrscheinlichkeitsmodell zu erzeugen. In dieser Arbeit wird eine Verbindung mit Faktorisierungen vorgestellt, die effizienteres Sampling erlaubt. Die freie Energie wird um die inverse Temperatur ß erweitert. Der oben erwähnte Algorithmus zur Erzeugung einer Faktorisierung kann auch hier angewendet werden. Die Dynamik von GBP wird untersucht und auf Ising-Modelle angewendet. Kleine Probleme (7 x 7) werden ohne Schwierigkeit gelöst. Größere Probleme (10 x 10 und 15 x 15) konvergieren mit großem ß nicht mehr zum wahren Optimum, aber durch Sampling von der Faktorisierung kann das Optimum bei einer Samplegröße von 1000 bis 10000, je nach Probleminstanz, gefunden werden. Wenn GBP nicht konvergiert, kann es durch eine Konkav-Konvex-Prozedur ersetzt werden, die Konvergenz garantiert. - Drittens kann, wenn für die Zielfunktion keine Struktur gegeben ist, ein Bayessches Netz gelernt werden, um die Abhängigkeiten in der Population zu erfassen. Die relative Entropie zwischen der Populationsverteilung und der Verteilung durch das Bayessche Netz ist äquivalent zur Log-Likelihood des Modells. Diese wurde erweitert zum BIC/MDL-Kriterium, das Überanpassung lindert, indem komplizierte Strukturen bestraft werden. Eine vorangegangene informationstheoretische Analyse von BIC/MDL im EDA-Bereich wird erweitert, und empirisch wird belegt, daß die Methode die korrekte Struktur einer Zielfunktion bei genügend großer Population lernen kann. - Schließlich wird vorgestellt, wie durch eine lokale Suchheuristik der Suchraum von EDA reduziert werden kann. Der Kernighan-Lin-Hillclimber, der ursprünglich für das Problem des Handlungsreisenden und Graphen-Bipartitionierung konzipiert ist, wird für beliebige binäre Probleme erweitert. Er kann allein angewandt werden, als iteratives 1+1-Suchverfahren, oder in Kombination mit EDA. Er löst das MAXSAT-Problem in ähnlicher Größenordnung wie der spezialisierte Hillclimber Walksat. Eine Analyse der lokalen Optima von Kernighan-Lin zeigt, daß die Kombination mit EDA vorteilhaft ist. Die Arbeit zeigt, wie evolutionäre Optimierung verbessert werden kann, indem interdisziplinäre Ergebnisse aus Informationstheorie, Statistik, Wahrscheinlichkeitsrechnung und statistischer Physik eingebracht werden. Die Prinzipien der Informationstheorie zur Schätzung von Wahrscheinlichkeitsverteilungen lassen sich in vielen Bereichen anwenden. EDAs sind eine gute Anwendung, denn eine verbesserte Schätzung beeinflußt direkt den Optimierungserfolg

    Generalized belief change with imprecise probabilities and graphical models

    Get PDF
    We provide a theoretical investigation of probabilistic belief revision in complex frameworks, under extended conditions of uncertainty, inconsistency and imprecision. We motivate our kinematical approach by specializing our discussion to probabilistic reasoning with graphical models, whose modular representation allows for efficient inference. Most results in this direction are derived from the relevant work of Chan and Darwiche (2005), that first proved the inter-reducibility of virtual and probabilistic evidence. Such forms of information, deeply distinct in their meaning, are extended to the conditional and imprecise frameworks, allowing further generalizations, e.g. to experts' qualitative assessments. Belief aggregation and iterated revision of a rational agent's belief are also explored

    Efficient Learning of Markov Blanket and Markov Blanket Classifier

    Get PDF
    RÉSUMÉ La sélection de variables est un problème de première importance dans le domaine de l'apprentissage machine et le forage de données. Pour une tâche de classification, un jalon important du développement de stratégies sélection de variables a été atteint par Koller et Shamai [1]. Sur la base des travaux de Pearl dans le domaine des réseaux bayésiens (RB) [2], ils ont démontré que la couverture de Markov (CM) d'une variable nominale représente le sous-ensemble optimal pour prédire sa valeur (classe). Différents algorithmes ont été développés pour d'induire la CM d'une variable cible à partir de données, sans pour autant nécessiter l'induction du RB qui inclue toutes les variables potentielles depuis 1996, mais ils affichent tous des problèmes de performance, soit au plan de la complexité calculatoire, soit au plan de la reconnaissance. La première contribution de cette thèse est le développement d'un nouvel algorithme pour cette tâche. L'algorithme IPC-MB [9-11] permet d'induire la CM d'une variable avec une performance qui combine les meilleures performances en terme de complexité calculatoire et de reconnaissance. IPC-MB effectue une recherche itérative des parents et enfants du noeud cible en minimisant le nombre de variables conditionnnelles des tests d'indépendance. Nous prouvons que l'algorithme est théoriquement correct et comparons sa performance avec les algorithmes les mieux connus, IAMB [12], PCMB [13] et PC [14]. Des expériences de simulations en utilisant des données générées de réseaux bayésiens connus, à savoir un réseau de petite envergure, Asia, contenant huit noeuds; deux réseaus de moyenne envergure, Alarm et PolyAlarm de 37 noeuds, et deux réseaux de plus grande envergure, Hailfinder contenant 56 noeuds et Test152 contenant 152 noeuds. Les résultats démontrent qu'avec un nombre comparable d'observations, (1) IPC-MB obtient une reconnaissance nettement plus élevée que IAMB, jusqu'à 80% de réduction de distance (par rapport à un résultat parfait), (2) IPC-MB a une reconnaissance légèrement supérieure que PCMB et PC, et (3) IPC-MB nécessite jusqu'à 98% moins de tests conditionnels que PC et 95% de moins que PCMB (le nombre de tests conditionnels représente la mesure de complexité calculatoire ici). La seconde contribution de la thèse est un algorithme pour induire la topologie du RB constitué des variables de la CM. Lorsqu'une CM d'une variable cible forme un RB, ce réseau est alors considéré comme un classificateur, nommé une Couverture de Markov de Classification (MBC). L'algorithme a été nommé IPC-MBC sur la base du premier algorithme, IPC-MB. À l'instar de IPC-MB, l'algorithme IPC-MBC effectue une série de recherches locales pour éliminer les faux-négatifs, incluant les noeuds et les arcs. Cependant, sa complexité est supérieure et requiert des ressources calculatoires plus importantes que IPC-MB. Nous prouvons que IPC-MB est théoriquement et effectuons des études empiriques pour comparer sa performance calculatoire et de reconnaissance par rapport à PC seul et PC combiné à IPC-MB (c.-à-d. l'induction de la structure du RB avec l'algorithme PC seul et avec PC appliqué sur le résultat de IPC-MB). Les mêmes données que pour les expériences de simulation de IPC-MB sont utilisées. Les résultats démontrent que IPC-MBC combiné à IPC-MB et que PC combiné à IPC-MB sont tous deux plus efficaces que PC seul en termes de temps de complexité calculatoires, fournissant jusqu'à 95% de réduction du nombre de tests conditionnels, sans pour autant avoir d'impact au plan du taux de reconnaissance.----------ABSTRACT Feature selection is a fundamental topic in data mining and machine learning. It addresses the issue of dimension reduction by removing non-relevant, or less relevant attributes in model building. For the task of classification, a major milestone for feature selection was achieved by Koller and Sahami [1]. Building upon the work of Pearl on Bayesian Networks (BN) [2], they proved that a Markov blanket (MB) of a variable is the optimal feature subset for class prediction. Deriving the MB of a class variable given a BN is a trivial problem. However, learning the structure of a BN from data is known to be NP hard. For large number of variables, learning the BN is impractical, not only because of the computational complexity, but also because of the data size requirement that is one of the curses of high dimensionality feature spaces. Hence, simpler topologies are often assumed, such as the Naive Bayes approach (NB) [5, 6], which is probably the best known one due its computational simplicity, requiring no structure learning, and also its surprising effectiveness in many applications despite its unrealistic assumptions. One of its extension, Tree-Augmented Naïve Bayes (TAN) [7] is shown to have a better performance than NB, by allowing limited additional dependencies among the features. However, because they make strong assumptions, these approaches may be flawed in general. By further relaxing the restriction on the dependencies, a BN is expected to show better performance in term of classification accuracy than NB and TAN [8]. The question is whether we can derive a MB without learning the full BN topology for the classification task. Let us refer to a MB for classification as a Markov Blanket Classifier, MBC. The MBC is expected to perform as well as the whole Bayesian network as a classifier, though it is generally much smaller in size than the whole network. This thesis addresses the problem of deriving the MBC effectively and efficiently from limited data. The goal is to outperform the simpler NB and TAN approaches that rely on potentially invalid assumptions, yet to allow MBC learning with limited data and low computational complexity. Our first contribution is to propose one novel algorithm to filter out non-relevant attributes of a MBC. From our review, it is known that there are at least nine existing published works on the learning of Markov blanket since 1996. However, there is no satisfactory tradeoff between correctness, data requirement and time efficiency. To address this tradeoff, we propose the IPC-MB algorithm [9-11]. IPC-MB performs an iterative search of the parents and children given a node of interest. We prove that the algorithm is sound in theory, and we compare it with the state of the art in MB learning, IAMB [12], PCMB [13] and PC [14]. Experiments are conducted using samples generated from known Bayesian networks, including small one like Asia with eight nodes, medium ones like Alarm and PolyAlarm (one polytree version of Alarm) with 37 nodes, and large ones like Hailfinder (56 nodes) and Test152 (152 nodes). The results demonstrate that, given the same amount of observations, (1) IPC-MB achieves much higher accuracy than IAMB, up to 80% reduction in distance (from the perfect result), (2) IPC-MB has slightly higher accuracy than PCMB and PC, (3) IPC-MB may require up to 98% fewer conditional independence (CI) tests than PC, and 95% fewer than PCMB. Given the output of IPC-MB, conventional structure learning algorithms can be applied to recover MBC without any modification since the feature selection procedure is transparent to them. In fact, the output of IPC-MB can be viewed as the output of general feature selection, and be employed further by all kinds of classifier. This algorithm was implemented by the author while working at SPSS and shipped with the software Clementine 12 in 2007. The second contribution is to extend IPC-MB to induce the MBC directly without having to depend on external structure learning algorithm, and the proposed algorithm is named IPC-MBC (or IPC-BNC in one of our early publication) [15]. Similar to IPC-MB, IPC-MBC conducts a series of local searches to filter out false negatives, including nodes and arcs. However, it is more complex and requires greater computing resource than IPC-MB. IPC-MBC is also proved sound in theory. In our empirical studies, we compare the accuracy and time cost between IPC-MBC, PC and IPC-MB plus PC (i.e. structure learning by PC on the features output by IPC-MB), with the same data as used in the study of IPC-MB. It is observed that both IPC-MBC and IPC-MB plus PC are much more time efficient than PC, with up to 95% saving of CI tests, but with no loss of accuracy. This reflects the advantage of local search and feature selection respectively

    On quantum bayesian networks

    Get PDF
    Dissertação de mestrado em Computer ScienceAs a compact representation of joint probability distributions over a dependence graph of random variables, and a tool for modeling and reasoning in the presence of uncertainty, Bayesian networks are becoming increasingly relevant both for natural and social sciences, for example, to combine domain knowledge, capture causal relationships, or learn from in complete datasets. Known as an NP- hard problem in a classical setting, Bayesian inference pops up as a class of algorithms worth to explore in a quantum framework. The present dissertation explores this research field and extends the previous algorithm by embedding them in decision-making processes. In this regard, several attempts were made in order to find new and enhanced ways to deal with these processes. In a first at tempt, the quantum device was considered to run a subprocess of the decision-making pro cess, resulting in a quadratic speed-up for that subprocess. Afterward, “decision-networks” were taken into account and allowed a fully quantum implementation of a decision-making process, benefiting from a quadratic speed-up during the whole process. Lastly, a solution was found. It differs from the existing ones by the judicious use of the utility function in an entangled configuration. This algorithm explores the structure of input data to efficiently compute a solution. In addition, for each one of the algorithms developed, their computa tional complexity was determined in order to provide the information necessary to choose the most efficient one for a concrete decision problem. A prototype implementation in Qiskit (a Python-based program development language for the IBM Q machines) was developed as a proof-of-concept. If Qiskit offered a simulation platform for the algorithm considered in this dissertation, string diagrams provided the verification framework for algorithmic proprieties. Further, string diagrams were studied with the intention to obtain formal proofs about the algorithms developed. This framework provided relevant examples and the proof that two different implementations for the same algorithm are equivalent.As redes Bayesianas tem-se tornado cada vez mais importantes no domínio das ciências naturais e sociais, na medida em que permitem inferir relações de causalidade entre variáveis e aprender através de conjuntos incompletos de dados. Trata-se de uma representação compacta de distribuição de probabilidade conjunta feita sobre um grafo que representa dependências entre variáveis. Num contexto clássico, inferência sobre estas estruturas é visto como um problema de complexidade NP destacando-se como uma das classes de algoritmos a explorar num enquadramento quântico. Esta dissertação explora este domínio de investigação e insere as redes Bayesianas num processo de tomada de decisão. Neste sentido, foram feitas várias tentativas para se encontrarem novas e melhores formas de lidar com estes processos. Numa primeira tentativa, considerou-se que o dispositivo quântico executava um subprocesso do processo de tomada de decisão, resultando numa aceleração quadrática do mesmo. Posteriormente, foram consideradas decision networks que permitiram uma implementação totalmente quântica de um processo de tomada de decisão. Através desta implementação foi possível obter uma aceleração quadrática durante todo o processo. Por fim, foi encontrada uma solução viável. Difere das já existentes pelo uso criterioso da função de utilidade num estado emaranhado. Este algoritmo explora a estrutura dos dados de entrada para calcular de forma eficaz uma solução. Além disso, para cada um dos algoritmos desenvolvidos, foi determinada a respetiva complexidade computacional de modo a que fossem conhecidas todas as informações necessárias para escolher o algoritmo mais eficiente para um determinado problema de decisão. Foi desenvolvida uma implementação inicial no Qiskit (um software que permite o desenvolvimento de programas baseados em Python para as máquinas IBM Q) como prova de conceito. Se o Qiskit ofereceu uma plataforma de simulação para o algoritmo considerado nesta dissertação, os string diagrams forneceram a estrutura de verificação para propriedades algorítmicas. Além disso, estes diagramas foram estudados com a intenção de se obter provas formais sobre os algoritmos desenvolvidos. Esta estrutura forneceu exemplos relevantes e a prova de que duas implementações diferentes para o mesmo algoritmo são equivalentes
    corecore