168 research outputs found
Explainable AI using expressive Boolean formulas
We propose and implement an interpretable machine learning classification
model for Explainable AI (XAI) based on expressive Boolean formulas. Potential
applications include credit scoring and diagnosis of medical conditions. The
Boolean formula defines a rule with tunable complexity (or interpretability),
according to which input data are classified. Such a formula can include any
operator that can be applied to one or more Boolean variables, thus providing
higher expressivity compared to more rigid rule-based and tree-based
approaches. The classifier is trained using native local optimization
techniques, efficiently searching the space of feasible formulas. Shallow rules
can be determined by fast Integer Linear Programming (ILP) or Quadratic
Unconstrained Binary Optimization (QUBO) solvers, potentially powered by
special purpose hardware or quantum devices. We combine the expressivity and
efficiency of the native local optimizer with the fast operation of these
devices by executing non-local moves that optimize over subtrees of the full
Boolean formula. We provide extensive numerical benchmarking results featuring
several baselines on well-known public datasets. Based on the results, we find
that the native local rule classifier is generally competitive with the other
classifiers. The addition of non-local moves achieves similar results with
fewer iterations, and therefore using specialized or quantum hardware could
lead to a speedup by fast proposal of non-local moves.Comment: 28 pages, 16 figures, 4 table
Active learning of link specifications using decision tree learning
In this work we presented an implementation that uses decision trees to learn highly accurate link specifications. We compared our approach with three state-of-the-art classifiers on nine datasets and showed, that our approach gives comparable results in a reasonable amount of time. It was also shown, that we outperform the state-of-the-art on four datasets by up to 30%, but are still behind slightly on average. The effect of user feedback on the active learning variant was inspected pertaining to the number of iterations needed to deliver good results. It was shown that we can get FScores
above 0.8 with most datasets after 14 iterations
Advanced analysis of branch and bound algorithms
Als de code van een cijferslot zoek is, kan het alleen geopend worden door alle cijfercomÂbinaties langs te gaan. In het slechtste geval is de laatste combinatie de juiste. Echter, als de code uit tien cijfers bestaat, moeten tien miljard mogelijkheden bekeken worden. De zogenaamde 'NP-lastige' problemen in het proefschrift van Marcel Turkensteen zijn vergelijkbaar met het 'cijferslotprobleem'. Ook bij deze problemen is het aantal mogelijkheden buitensporig groot. De kunst is derhalve om de zoekruimte op een slimme manier af te tasten. Bij de Branch and Bound (BnB) methode wordt dit gedaan door de zoekruimte op te splitsen in kleinere deelgebieden. Turkensteen past de BnB methode onder andere toe bij het handelsreizigersprobleem, waarbij een kortste route door een verzameling plaatsen bepaald moet worden. Dit probleem is in algemene vorm nog steeds niet opgelost. De economische gevolgen kunnen groot zijn: zo staat nog steeds niet vast of bijvoorbeeld een routeplanner vrachtwagens optimaal laat rondrijden. De huidige BnB-methoden worden in dit proefschrift met name verbeterd door niet naar de kosten van een verbinding te kijken, maar naar de kostentoename als een verbinding niet gebruikt wordt: de boventolerantie.
Nested Sampling Methods
Nested sampling (NS) computes parameter posterior distributions and makes
Bayesian model comparison computationally feasible. Its strengths are the
unsupervised navigation of complex, potentially multi-modal posteriors until a
well-defined termination point. A systematic literature review of nested
sampling algorithms and variants is presented. We focus on complete algorithms,
including solutions to likelihood-restricted prior sampling, parallelisation,
termination and diagnostics. The relation between number of live points,
dimensionality and computational cost is studied for two complete algorithms. A
new formulation of NS is presented, which casts the parameter space exploration
as a search on a tree. Previously published ways of obtaining robust error
estimates and dynamic variations of the number of live points are presented as
special cases of this formulation. A new on-line diagnostic test is presented
based on previous insertion rank order work. The survey of nested sampling
methods concludes with outlooks for future research.Comment: Updated version incorporating constructive input from four(!)
positive reports (two referees, assistant editor and editor). The open-source
UltraNest package and astrostatistics tutorials can be found at
https://johannesbuchner.github.io/UltraNest
Deformation Correlations and Machine Learning: Microstructural inference and crystal plasticity predictions
The present thesis makes a connection between spatially resolved strain correlations and material processing history. Such correlations can be used to infer and classify prior deformation history of a sample at various strain levels with the use of Machine Learning approaches. A simple and concrete example of uniaxially compressed crystalline thin films of various sizes, generated by two-dimensional discrete dislocation plasticity simulations is examined. At the nanoscale, thin films exhibit yield-strength size effects with noisy mechanical responses which create an interesting challenge for the application of Machine Learning techniques. Moreover, this thesis demonstrates the prediction of the average mechanical responses of thin films based on the classified prior deformation history and discusses the possible ramifications for modelling crystal plasticity behavior in extreme settings
Machine Learning for Disease Outbreak Detection Using Probabilistic Models
RĂSUMĂ
Lâexpansion de maladies connues et lâĂ©mergence de nouvelles maladies ont affectĂ© la vie de nombreuses personnes et ont eu des consĂ©quences Ă©conomiques importantes. LâĂbola nâest que le dernier des exemples rĂ©cents. La dĂ©tection prĂ©coce dâinfections Ă©pidĂ©miologiques sâavĂšre donc un enjeu de taille. Dans le secteur de la surveillance syndromique, nous avons assistĂ© rĂ©cemment Ă une prolifĂ©ration dâalgorithmes de dĂ©tection dâĂ©pidĂ©mies. Leur performance peut varier entre eux et selon diffĂ©rents paramĂštres de configuration, de sorte que lâefficacitĂ© dâun systĂšme de surveillance Ă©pidĂ©miologique sâen trouve dâautant affectĂ©. Pourtant, on ne possĂšde que peu dâĂ©valuations fiables de la performance de ces algorithmes sous diffĂ©rentes conditions et pour diffĂ©rents types dâĂ©pidĂ©mie.
Les Ă©valuations existantes sont basĂ©es sur des cas uniques et les donnĂ©es ne sont pas du domaine public. Il est donc difficile de comparer ces algorithmes entre eux et difficile de juger de la gĂ©nĂ©ralisation des rĂ©sultats. Par consĂ©quent, nous ne sommes pas en mesure de dĂ©terminer quel dâalgorithme devrait ĂȘtre appliquĂ© dans quelles circonstances.
Cette thĂšse poursuit trois objectifs gĂ©nĂ©raux : (1) Ă©tablir la relation entre la performance des algorithmes de dĂ©tection dâĂ©pidĂ©mies et le type et la sĂ©vĂ©ritĂ© de ces Ă©pidĂ©mies, (2) amĂ©liorer les prĂ©dictions dâĂ©pidĂ©mies par la combinaison dâalgorithmes et (3) fournir une mĂ©thode dâanalyse des Ă©pidĂ©mies qui englobe une perspective de coĂ»ts afin de minimiser lâimpact Ă©conomique des erreurs du type faux positifs et faux nĂ©gatifs.
Lâapproche gĂ©nĂ©rale de notre Ă©tude repose sur lâutilisation de donnĂ©es de simulation dâĂ©pidĂ©mies dont le vecteur de transmission est un rĂ©seau dâaqueducs. Les donnĂ©es sont obtenues de la plateforme de simulation SnAP du Department of Epidemiology and Biostatistics Surveillance Lab de lâuniversitĂ© McGill. Cette approche nous permet de crĂ©er les diffĂ©rentes conditions de types et dâintensitĂ©s dâĂ©pidĂ©miologie nĂ©cessaires Ă lâanalyse de la performance des algorithmes de dĂ©tection.
Le premier objectif porte sur lâinfluence des diffĂ©rents types et diffĂ©rentes intensitĂ©s dâĂ©pidĂ©miologie sur la performance des algorithmes. Elle est modĂ©lisĂ©e Ă lâaide dâun modĂšle basĂ© sur un rĂ©seau bayĂ©sien. Ce modĂšle prĂ©dit avec succĂšs la variation de performance observĂ©e dans les donnĂ©es. De plus, lâutilisation dâun rĂ©seau bayĂ©sien permet de quantifier lâinfluence de chaque variable et relĂšve aussi le rĂŽle que jouent dâautres paramĂštres qui Ă©taient jusquâici ignorĂ©s dans les travaux antĂ©rieurs,
Ă savoir le seuil de dĂ©tection et lâimportance de tenir compte de rĂ©currences hebdomadaires.
Le second objectif vise Ă exploiter les rĂ©sultats autour du premier objectif et de combiner les algorithmes pour optimiser la performance en fonction des facteurs dâinfluence. Les rĂ©sultats des algorithmes sont combinĂ©s Ă lâaide de la mĂ©thode de Mixture hiĂ©rarchique dâexpert (Hierarchical Mixture of ExpertsâHME). Le modĂšle HME est entraĂźnĂ© Ă pondĂ©rer la contribution de chaque algorithme en fonction des donnĂ©es. Les rĂ©sultats de cette combinaison des rĂ©sultats dâalgorithmes sont comparables avec les meilleurs rĂ©sultats des algorithmes individuels, et sâavĂšrent plus robustes Ă travers diffĂ©rentes variations. Le niveau de contamination nâinfluence pas la performance relative du modĂšle HME.
Finalement, nous avons tentĂ© dâoptimiser des mĂ©thodes de dĂ©tection dâĂ©pidĂ©mies en fonction des coĂ»ts et bĂ©nĂ©fices escomptĂ©s des prĂ©dictions correctes et incorrects. Les rĂ©sultats des algorithms de dĂ©tection sont Ă©valuĂ©s en fonction des dĂ©cisions possibles qui en dĂ©coulent et en tenant compte de donnĂ©es rĂ©elles sur les coĂ»ts totaux dâutilisation des ressources du systĂšme de santĂ©. Dans un premier temps, une rĂ©gression polynomiale permet dâestimer le coĂ»t dâune Ă©pidĂ©mie selon le dĂ©lai de dĂ©tection. Puis, nous avons dĂ©veloppĂ© un modĂšle dâapprentissage dâarbre de dĂ©cision qui tient compte du coĂ»t et qui prĂ©dit les dĂ©tections Ă partir des algorithmes connus. Les rĂ©sultats expĂ©rimentaux dĂ©montrent que ce modĂšle permet de rĂ©duire le coĂ»t total des Ă©pidĂ©mies, tout en maintenant le niveau de dĂ©tection des Ă©pidĂ©mies comparables Ă ceux dâautres mĂ©thodes.----------ABSTRACT
The past decade has seen the emergence of new diseases or expansion of old ones (such as Ebola) causing high human and financial costs. Hence, early detection of disease outbreaks is
crucial. In the field of syndromic surveillance, there has recently been a proliferation of outbreak detection algorithms. The choice of outbreak detection algorithm and its configuration can result in important variations in the performance of public health surveillance systems. But performance evaluations have not kept pace with algorithm development. These evaluations are usually based on a single data set which is not publicly available, so the evaluations are difficult to generalize or replicate. Furthermore, the performance of different algorithms is influenced by the nature of the disease outbreak. As a result of the lack of thorough performance evaluations, one cannot determine which algorithm should be applied under what circumstances.
Briefly, this research has three general objectives: (1) characterize the dependence of the performance of detection algorithms on the type and severity of outbreak, (2) aggregate the predictions of several outbreak detection algorithms, (3) analyze outbreak detection methods from a cost-benefit point of view and develop a detection method which minimizes the total cost of missing outbreaks and false alarms. To achieve the first objective, we propose a Bayesian network model learned from simulated outbreak data overlaid on real healthcare utilization data which predicts detection performance as a function of outbreak characteristics and surveillance system parameters. This model predicts the performance of outbreak detection methods with high accuracy. The model can also quantify the influence of different outbreak characteristics and detection methods on detection performance in a variety of practically relevant surveillance scenarios. In addition to identifying outbreak characteristics expected to have a strong influence on detection performance, the learned model suggests a role for other algorithm features, such as alerting threshold and taking weekly patterns into account, which was previously not the focus of attention in the literature.
To achieve the second objective, we use Hierarchical Mixture of Experts (HME) to combine the responses of multiple experts (i.e., predictors) which are outbreak detection methods. The contribution of each predictor in forming the final output is learned and depends on the input data. The developed HME algorithm is competitive with the best detection algorithm in the experimental evaluation, and is more robust under different circumstances. The level of contamination of the surveillance time series does not influence the relative performance of the HME. The optimization of outbreak detection methods also relies on the estimation of future benefits of true alarms and the cost of false alarms. In the third part of the thesis, we analyze some commonly used outbreak detection methods in terms of the cost of missing outbreaks and false alarms, using simulated outbreak data overlaid on real healthcare utilization data. We estimate the total cost of missing outbreaks and false alarms, in addition to the accuracy of outbreak detection and we fit a polynomial regression function to estimate the cost of an outbreak based on the delay until it is detected. Then, we develop a cost-sensitive decision tree learner, which predicts outbreaks by looking at the prediction of commonly used detection methods. Experimental results show that using the developed cost-sensitive decision tree decreases the total cost of the outbreak, while the accuracy of outbreak detection remains competitive with commonly used methods
Approximate Bayesian conditional copulas
Copula models are flexible tools to represent complex structures of dependence for multivariate random variables. According to Sklar's theorem, any multidimensional absolutely continuous distribution function can be uniquely represented as a copula, i.e. a joint cumulative distribution function on the unit hypercube with uniform marginals, which captures the dependence structure among the vector components. In real data applications, the interest of the analyses often lies on specific functionals of the dependence, which quantify aspects of it in a few numerical values. A broad literature exists on such functionals, however extensions to include covariates are still limited. This is mainly due to the lack of unbiased estimators of the conditional copula, especially when one does not have enough information to select the copula model. Several Bayesian methods to approximate the posterior distribution of functionals of the dependence varying according covariates are presented and compared; the main advantage of the investigated methods is that they use nonparametric models, avoiding the selection of the copula, which is usually a delicate aspect of copula modelling. These methods are compared in simulation studies and in two realistic applications, from civil engineering and astrophysics. (C) 2022 Elsevier B.V. All rights reserved
- âŠ