168 research outputs found

    Explainable AI using expressive Boolean formulas

    Full text link
    We propose and implement an interpretable machine learning classification model for Explainable AI (XAI) based on expressive Boolean formulas. Potential applications include credit scoring and diagnosis of medical conditions. The Boolean formula defines a rule with tunable complexity (or interpretability), according to which input data are classified. Such a formula can include any operator that can be applied to one or more Boolean variables, thus providing higher expressivity compared to more rigid rule-based and tree-based approaches. The classifier is trained using native local optimization techniques, efficiently searching the space of feasible formulas. Shallow rules can be determined by fast Integer Linear Programming (ILP) or Quadratic Unconstrained Binary Optimization (QUBO) solvers, potentially powered by special purpose hardware or quantum devices. We combine the expressivity and efficiency of the native local optimizer with the fast operation of these devices by executing non-local moves that optimize over subtrees of the full Boolean formula. We provide extensive numerical benchmarking results featuring several baselines on well-known public datasets. Based on the results, we find that the native local rule classifier is generally competitive with the other classifiers. The addition of non-local moves achieves similar results with fewer iterations, and therefore using specialized or quantum hardware could lead to a speedup by fast proposal of non-local moves.Comment: 28 pages, 16 figures, 4 table

    Active learning of link specifications using decision tree learning

    Get PDF
    In this work we presented an implementation that uses decision trees to learn highly accurate link specifications. We compared our approach with three state-of-the-art classifiers on nine datasets and showed, that our approach gives comparable results in a reasonable amount of time. It was also shown, that we outperform the state-of-the-art on four datasets by up to 30%, but are still behind slightly on average. The effect of user feedback on the active learning variant was inspected pertaining to the number of iterations needed to deliver good results. It was shown that we can get FScores above 0.8 with most datasets after 14 iterations

    Advanced analysis of branch and bound algorithms

    Get PDF
    Als de code van een cijferslot zoek is, kan het alleen geopend worden door alle cijfercom­binaties langs te gaan. In het slechtste geval is de laatste combinatie de juiste. Echter, als de code uit tien cijfers bestaat, moeten tien miljard mogelijkheden bekeken worden. De zogenaamde 'NP-lastige' problemen in het proefschrift van Marcel Turkensteen zijn vergelijkbaar met het 'cijferslotprobleem'. Ook bij deze problemen is het aantal mogelijkheden buitensporig groot. De kunst is derhalve om de zoekruimte op een slimme manier af te tasten. Bij de Branch and Bound (BnB) methode wordt dit gedaan door de zoekruimte op te splitsen in kleinere deelgebieden. Turkensteen past de BnB methode onder andere toe bij het handelsreizigersprobleem, waarbij een kortste route door een verzameling plaatsen bepaald moet worden. Dit probleem is in algemene vorm nog steeds niet opgelost. De economische gevolgen kunnen groot zijn: zo staat nog steeds niet vast of bijvoorbeeld een routeplanner vrachtwagens optimaal laat rondrijden. De huidige BnB-methoden worden in dit proefschrift met name verbeterd door niet naar de kosten van een verbinding te kijken, maar naar de kostentoename als een verbinding niet gebruikt wordt: de boventolerantie.

    Nested Sampling Methods

    Full text link
    Nested sampling (NS) computes parameter posterior distributions and makes Bayesian model comparison computationally feasible. Its strengths are the unsupervised navigation of complex, potentially multi-modal posteriors until a well-defined termination point. A systematic literature review of nested sampling algorithms and variants is presented. We focus on complete algorithms, including solutions to likelihood-restricted prior sampling, parallelisation, termination and diagnostics. The relation between number of live points, dimensionality and computational cost is studied for two complete algorithms. A new formulation of NS is presented, which casts the parameter space exploration as a search on a tree. Previously published ways of obtaining robust error estimates and dynamic variations of the number of live points are presented as special cases of this formulation. A new on-line diagnostic test is presented based on previous insertion rank order work. The survey of nested sampling methods concludes with outlooks for future research.Comment: Updated version incorporating constructive input from four(!) positive reports (two referees, assistant editor and editor). The open-source UltraNest package and astrostatistics tutorials can be found at https://johannesbuchner.github.io/UltraNest

    Optimization of Thermo-mechanical Conditions in Friction Stir Welding

    Get PDF

    Deformation Correlations and Machine Learning: Microstructural inference and crystal plasticity predictions

    Get PDF
    The present thesis makes a connection between spatially resolved strain correlations and material processing history. Such correlations can be used to infer and classify prior deformation history of a sample at various strain levels with the use of Machine Learning approaches. A simple and concrete example of uniaxially compressed crystalline thin films of various sizes, generated by two-dimensional discrete dislocation plasticity simulations is examined. At the nanoscale, thin films exhibit yield-strength size effects with noisy mechanical responses which create an interesting challenge for the application of Machine Learning techniques. Moreover, this thesis demonstrates the prediction of the average mechanical responses of thin films based on the classified prior deformation history and discusses the possible ramifications for modelling crystal plasticity behavior in extreme settings

    Machine Learning for Disease Outbreak Detection Using Probabilistic Models

    Get PDF
    RÉSUMÉ L’expansion de maladies connues et l’émergence de nouvelles maladies ont affectĂ© la vie de nombreuses personnes et ont eu des consĂ©quences Ă©conomiques importantes. L’Ébola n’est que le dernier des exemples rĂ©cents. La dĂ©tection prĂ©coce d’infections Ă©pidĂ©miologiques s’avĂšre donc un enjeu de taille. Dans le secteur de la surveillance syndromique, nous avons assistĂ© rĂ©cemment Ă  une prolifĂ©ration d’algorithmes de dĂ©tection d’épidĂ©mies. Leur performance peut varier entre eux et selon diffĂ©rents paramĂštres de configuration, de sorte que l’efficacitĂ© d’un systĂšme de surveillance Ă©pidĂ©miologique s’en trouve d’autant affectĂ©. Pourtant, on ne possĂšde que peu d’évaluations fiables de la performance de ces algorithmes sous diffĂ©rentes conditions et pour diffĂ©rents types d’épidĂ©mie. Les Ă©valuations existantes sont basĂ©es sur des cas uniques et les donnĂ©es ne sont pas du domaine public. Il est donc difficile de comparer ces algorithmes entre eux et difficile de juger de la gĂ©nĂ©ralisation des rĂ©sultats. Par consĂ©quent, nous ne sommes pas en mesure de dĂ©terminer quel d’algorithme devrait ĂȘtre appliquĂ© dans quelles circonstances. Cette thĂšse poursuit trois objectifs gĂ©nĂ©raux : (1) Ă©tablir la relation entre la performance des algorithmes de dĂ©tection d’épidĂ©mies et le type et la sĂ©vĂ©ritĂ© de ces Ă©pidĂ©mies, (2) amĂ©liorer les prĂ©dictions d’épidĂ©mies par la combinaison d’algorithmes et (3) fournir une mĂ©thode d’analyse des Ă©pidĂ©mies qui englobe une perspective de coĂ»ts afin de minimiser l’impact Ă©conomique des erreurs du type faux positifs et faux nĂ©gatifs. L’approche gĂ©nĂ©rale de notre Ă©tude repose sur l’utilisation de donnĂ©es de simulation d’épidĂ©mies dont le vecteur de transmission est un rĂ©seau d’aqueducs. Les donnĂ©es sont obtenues de la plateforme de simulation SnAP du Department of Epidemiology and Biostatistics Surveillance Lab de l’universitĂ© McGill. Cette approche nous permet de crĂ©er les diffĂ©rentes conditions de types et d’intensitĂ©s d’épidĂ©miologie nĂ©cessaires Ă  l’analyse de la performance des algorithmes de dĂ©tection. Le premier objectif porte sur l’influence des diffĂ©rents types et diffĂ©rentes intensitĂ©s d’épidĂ©miologie sur la performance des algorithmes. Elle est modĂ©lisĂ©e Ă  l’aide d’un modĂšle basĂ© sur un rĂ©seau bayĂ©sien. Ce modĂšle prĂ©dit avec succĂšs la variation de performance observĂ©e dans les donnĂ©es. De plus, l’utilisation d’un rĂ©seau bayĂ©sien permet de quantifier l’influence de chaque variable et relĂšve aussi le rĂŽle que jouent d’autres paramĂštres qui Ă©taient jusqu’ici ignorĂ©s dans les travaux antĂ©rieurs, Ă  savoir le seuil de dĂ©tection et l’importance de tenir compte de rĂ©currences hebdomadaires. Le second objectif vise Ă  exploiter les rĂ©sultats autour du premier objectif et de combiner les algorithmes pour optimiser la performance en fonction des facteurs d’influence. Les rĂ©sultats des algorithmes sont combinĂ©s Ă  l’aide de la mĂ©thode de Mixture hiĂ©rarchique d’expert (Hierarchical Mixture of Experts—HME). Le modĂšle HME est entraĂźnĂ© Ă  pondĂ©rer la contribution de chaque algorithme en fonction des donnĂ©es. Les rĂ©sultats de cette combinaison des rĂ©sultats d’algorithmes sont comparables avec les meilleurs rĂ©sultats des algorithmes individuels, et s’avĂšrent plus robustes Ă  travers diffĂ©rentes variations. Le niveau de contamination n’influence pas la performance relative du modĂšle HME. Finalement, nous avons tentĂ© d’optimiser des mĂ©thodes de dĂ©tection d’épidĂ©mies en fonction des coĂ»ts et bĂ©nĂ©fices escomptĂ©s des prĂ©dictions correctes et incorrects. Les rĂ©sultats des algorithms de dĂ©tection sont Ă©valuĂ©s en fonction des dĂ©cisions possibles qui en dĂ©coulent et en tenant compte de donnĂ©es rĂ©elles sur les coĂ»ts totaux d’utilisation des ressources du systĂšme de santĂ©. Dans un premier temps, une rĂ©gression polynomiale permet d’estimer le coĂ»t d’une Ă©pidĂ©mie selon le dĂ©lai de dĂ©tection. Puis, nous avons dĂ©veloppĂ© un modĂšle d’apprentissage d’arbre de dĂ©cision qui tient compte du coĂ»t et qui prĂ©dit les dĂ©tections Ă  partir des algorithmes connus. Les rĂ©sultats expĂ©rimentaux dĂ©montrent que ce modĂšle permet de rĂ©duire le coĂ»t total des Ă©pidĂ©mies, tout en maintenant le niveau de dĂ©tection des Ă©pidĂ©mies comparables Ă  ceux d’autres mĂ©thodes.----------ABSTRACT The past decade has seen the emergence of new diseases or expansion of old ones (such as Ebola) causing high human and financial costs. Hence, early detection of disease outbreaks is crucial. In the field of syndromic surveillance, there has recently been a proliferation of outbreak detection algorithms. The choice of outbreak detection algorithm and its configuration can result in important variations in the performance of public health surveillance systems. But performance evaluations have not kept pace with algorithm development. These evaluations are usually based on a single data set which is not publicly available, so the evaluations are difficult to generalize or replicate. Furthermore, the performance of different algorithms is influenced by the nature of the disease outbreak. As a result of the lack of thorough performance evaluations, one cannot determine which algorithm should be applied under what circumstances. Briefly, this research has three general objectives: (1) characterize the dependence of the performance of detection algorithms on the type and severity of outbreak, (2) aggregate the predictions of several outbreak detection algorithms, (3) analyze outbreak detection methods from a cost-benefit point of view and develop a detection method which minimizes the total cost of missing outbreaks and false alarms. To achieve the first objective, we propose a Bayesian network model learned from simulated outbreak data overlaid on real healthcare utilization data which predicts detection performance as a function of outbreak characteristics and surveillance system parameters. This model predicts the performance of outbreak detection methods with high accuracy. The model can also quantify the influence of different outbreak characteristics and detection methods on detection performance in a variety of practically relevant surveillance scenarios. In addition to identifying outbreak characteristics expected to have a strong influence on detection performance, the learned model suggests a role for other algorithm features, such as alerting threshold and taking weekly patterns into account, which was previously not the focus of attention in the literature. To achieve the second objective, we use Hierarchical Mixture of Experts (HME) to combine the responses of multiple experts (i.e., predictors) which are outbreak detection methods. The contribution of each predictor in forming the final output is learned and depends on the input data. The developed HME algorithm is competitive with the best detection algorithm in the experimental evaluation, and is more robust under different circumstances. The level of contamination of the surveillance time series does not influence the relative performance of the HME. The optimization of outbreak detection methods also relies on the estimation of future benefits of true alarms and the cost of false alarms. In the third part of the thesis, we analyze some commonly used outbreak detection methods in terms of the cost of missing outbreaks and false alarms, using simulated outbreak data overlaid on real healthcare utilization data. We estimate the total cost of missing outbreaks and false alarms, in addition to the accuracy of outbreak detection and we fit a polynomial regression function to estimate the cost of an outbreak based on the delay until it is detected. Then, we develop a cost-sensitive decision tree learner, which predicts outbreaks by looking at the prediction of commonly used detection methods. Experimental results show that using the developed cost-sensitive decision tree decreases the total cost of the outbreak, while the accuracy of outbreak detection remains competitive with commonly used methods

    Approximate Bayesian conditional copulas

    Get PDF
    Copula models are flexible tools to represent complex structures of dependence for multivariate random variables. According to Sklar's theorem, any multidimensional absolutely continuous distribution function can be uniquely represented as a copula, i.e. a joint cumulative distribution function on the unit hypercube with uniform marginals, which captures the dependence structure among the vector components. In real data applications, the interest of the analyses often lies on specific functionals of the dependence, which quantify aspects of it in a few numerical values. A broad literature exists on such functionals, however extensions to include covariates are still limited. This is mainly due to the lack of unbiased estimators of the conditional copula, especially when one does not have enough information to select the copula model. Several Bayesian methods to approximate the posterior distribution of functionals of the dependence varying according covariates are presented and compared; the main advantage of the investigated methods is that they use nonparametric models, avoiding the selection of the copula, which is usually a delicate aspect of copula modelling. These methods are compared in simulation studies and in two realistic applications, from civil engineering and astrophysics. (C) 2022 Elsevier B.V. All rights reserved
    • 

    corecore