11 research outputs found

    On the Geometry of Bayesian Inference

    Get PDF
    We provide a geometric interpretation to Bayesian inference that allows us to introduce a natural measure of the level of agreement between priors, likelihoods, and posteriors. The starting point for the construction of our geometry is the simple observation that the marginal likelihood can be regarded as an inner product between the prior and the likelihood. A key concept in our geometry is that of compatibility, a measure which is based on the same construction principles as Pearson correlation, but which can be used to assess how much the prior agrees with the likelihood, to gauge the sensitivity of the posterior to the prior, and to quantify the coherency of the opinions of two experts. Estimators for all the quantities involved in our geometric setup are discussed, which can be directly computed from the posterior simulation output. Some examples are used to illustrate our methods, including data related to on-the-job drug usage, midge wing length, and prostate cancer

    The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models

    Full text link
    Thompson sampling (TS) has been known for its outstanding empirical performance supported by theoretical guarantees across various reward models in the classical stochastic multi-armed bandit problems. Nonetheless, its optimality is often restricted to specific priors due to the common observation that TS is fairly insensitive to the choice of the prior when it comes to asymptotic regret bounds. However, when the model contains multiple parameters, the optimality of TS highly depends on the choice of priors, which casts doubt on the generalizability of previous findings to other models. To address this gap, this study explores the impact of selecting noninformative priors, offering insights into the performance of TS when dealing with new models that lack theoretical understanding. We first extend the regret analysis of TS to the model of uniform distributions with unknown supports, which would be the simplest non-regular model. Our findings reveal that changing noninformative priors can significantly affect the expected regret, aligning with previously known results in other multiparameter bandit models. Although the uniform prior is shown to be optimal, we highlight the inherent limitation of its optimality, which is limited to specific parameterizations and emphasizes the significance of the invariance property of priors. In light of this limitation, we propose a slightly modified TS-based policy, called TS with Truncation (TS-T), which can achieve the asymptotic optimality for the Gaussian models and the uniform models by using the reference prior and the Jeffreys prior that are invariant under one-to-one reparameterizations. This policy provides an alternative approach to achieving optimality by employing fine-tuned truncation, which would be much easier than hunting for optimal priors in practice.Comment: 55 pages, TBA AAAI202

    Regional adaptive smoothing in disease mapping models

    Get PDF
    Disease mapping focuses on estimating spatial patterns and evolution of disease risk based on measures of disease effect, incidence ratio, for instance. These measures on maps provide a good visual representation of disease risk, featuring spatial heterogeneities and highlighting units or clusters of high risk. There are a number of factors due to which some units may have higher numbers of disease incidences compared to others, for example, differences in environmental exposures, deprived communities, different administrative structures, lack of awareness about the disease(s), to name a few. To clearly differentiate areas of low or high risk, we must apply some type of smoothing over regions which are presumably similar to each other. The process of such local smoothing increases our ability to clearly discern clusters in the spatial variation. In this thesis, we propose a novel approach to smooth local spatial units based on their larger regional location. The degree of smoothing is constant within each region, but differs between regions. For the purpose of illustration, we consider the spatial structure of German counties as local spatial units with Federal states of Germany as larger regions to apply state-wise adaptive smoothing. Chapters 4 and 5 propose univariate and multivariate spatial models of regional smoothing. We define that the incidences in each county follow a binomial density with probability of risk linked to spatial correlation matrix in a hierarchical way. The correlation matrix is partitioned into sub-matrices, corresponding to regions (Federal states), and smoothing parameters are introduced into the sub-matrices to locally smooth regions. Appropriate prior assumptions are stated for unknown parameters and samples from full conditional posterior densities are generated using MCMC. In Chapters 5, we adopt coregionalization framework of MacNab (2016) to build multivariate GMRFs as a linear combination of latent independent univariate GMRFs. The smoothing parameters are first applied to each sample separately, in a similar fashion to univariate regionalized spatial model, and then combined in the form of joint correlation matrix. We use the approach of Anderson et al. (2014) to identify spatial units exhibiting alike disease risks. The approach first elicits configuration of clusters based on past data of disease. In the second step, it fits a Poisson log-linear model using current data to select the best configuration based on deviance information criterion. The proposed method of smoothing is illustrated using real data sets of Oral cancer (univariate) and Colon, Lung and Pancreatic cancers (multivariate) on spatial structure of German counties. We are able to identify 13 clusters of Oral cancer, 9 of Colon, 6 of Lung and 8 of Pancreatic cancer. The identified clusters are further ranked based on incidence ratios. The analysis of real data and its comparison with simple GMRF (BYM model of Besag et al. (1991)) reveals that the novel method of incorporating smoothing parameters in spatial correlation matrix performs equally well, if not better.Krankheitskartierung (Disease Mapping) befasst sich mit der Schätzung räumlicher Muster und Entwicklungen des Krankheitsrisikos auf der Grundlage von Messungen des Krankheitsinzidenz. Die Kartierung von Messungen auf Landkarten bietet eine gute visuelle Darstellung des Krankheitsrisikos, wobei räumliche Heterogenitäten deutlich werden und Einheiten und Cluster mit hohem Risiko sichtbar gemacht werden können. Aufgrund einer Reihe von Faktoren können manche Einheiten im Vergleich zu anderen eine höhere Anzahl von Krankheitsfällen aufweisen, z.B. wegen unterschiedlicher Umweltbelastungen, sozial benachteiligter Bevölkerungsgruppen, unterschiedlicher Verwaltungsstrukturen, fehlendem Bewusstsein für die Krankheit(en), um nur einige Faktoren zu nennen. Um Bereiche mit niedrigem oder hohem Risiko klar zu unterscheiden, muss eine Art Glättung über Regionen vorgenommen werden, die einander ähnlich sein dürften. Der Prozess einer solchen lokalen Glättung verbessert den Prozess der klaren Unterscheidung von Clustern hinsichtlich der räumlichen Variation. In dieser Arbeit schlagen wir einen neuartigen Ansatz zur Glättung lokaler räumlicher Einheiten auf der Grundlage ihrer großräumigen regionalen Lage vor. Der Grad der Glättung ist innerhalb jeder Region konstant, unterscheidet sich jedoch von Region zu Region. Zur Veranschaulichung betrachten wir die räumliche Struktur von deutschen Landkreisen als lokale Raumeinheiten mit Bundesländern als größere Regionen, um die länderspezifische adaptive Glättung anzuwenden. In den Kapiteln 4 und 5 schlagen wir univariate und multivariate räumliche Modelle zur regionalen Glättung vor. Wir legen fest, dass die Inzidenzen in jedem Kreis einer binomialen Dichte mit Risikowahrscheinlichkeit folgen, die mit einer räumlichen Korrelationsmatrix in hierarchischer Weise verknüpft ist. Die Korrelationsmatrix wird in Untermatrizen unterteilt, die Regionen (Bundesländern) entsprechen, und Glättungsparameter werden in die Untermatrizen eingeführt, um die Regionen lokal zu glätten. Für unbekannte Parameter werden geeignete Vorannahmen aufgestellt und mit Hilfe von MCMC werden Stichproben aus den Posterioridichten generiert. In Kapitel 5 übernehmen wir den Koregionalisierungsrahmen von MacNab (2016), um multivariate GMRFs als Linearkombination von latenten unabhängigen univariaten GMRFs zu erstellen. Die Glättungsparameter werden zunächst auf jede Stichprobe einzeln angewendet, ähnlich wie bei einem univariaten regionalisierten Raummodell, und dann in Form einer gemeinsamen Korrelationsmatrix kombiniert. Wir verwenden den Ansatz von Anderson et al. (2014), um räumliche Einheiten zu identifizieren, die ähnliche Krankheitsrisiken aufweisen. Der Ansatz eruiert zunächst die Konfiguration von Clustern auf der Grundlage früherer Krankheitsdaten. Im zweiten Schritt passt er ein log-lineares Poisson-Modell unter Verwendung aktueller Daten an, um die beste Konfiguration auf der Grundlage des Abweichungsinformationskriteriums auszuwählen. Die vorgeschlagene Methode der Glättung wird anhand realer Datensätze von Mundkrebs (univariat) und Dickdarm-, Lungen- und Bauchspeicheldrüsenkrebs (multivariat) auf der räumlichen Struktur deutscher Landkreise veranschaulicht. Wir sind in der Lage, 13 Cluster für Mundkrebs, 9 für Dickdarmkrebs, 6 für Lungenkrebs und 8 für Bauchspeicheldrüsenkrebs zu identifizieren. Die identifizierten Cluster werden anhand von Inzidenzverhältnissen weiter gereiht. Die Analyse der realen Daten und ihr Vergleich mit dem einfachen GMRF (BYM-Modell von Besag et al. (1991)) zeigt, dass die neuartige Methode der Einbeziehung von Glättungsparametern in der räumlichen Korrelationsmatrix gleich gut, wenn nicht sogar besser abschneidet

    Geometric Methods in Machine Learning and Data Mining

    Get PDF
    In machine learning, the standard goal of is to find an appropriate statistical model from a model space based on the training data from a data space; while in data mining, the goal is to find interesting patterns in the data from a data space. In both fields, these spaces carry geometric structures that can be exploited using methods that make use of these geometric structures (we shall call them geometric methods), or the problems themselves can be formulated in a way that naturally appeal to these methods. In such cases, studying these geometric structures and then using appropriate geometric methods not only gives insight into existing algorithms, but also helps build new and better algorithms. In my research, I develop methods that exploit geometric structure of problems for a variety of machine learning and data mining problems, and provide strong theoretical and empirical evidence in favor of using them. My dissertation is divided into two parts. In the first part, I develop algorithms to solve a well known problem in data mining i.e. distance embedding problem. In particular, I use tools from computational geometry to build a unified framework for solving a distance embedding problem known as multidimensional scaling (MDS). This geometry-inspired framework results in algorithms that can solve different variants of MDS better than previous state-of-the-art methods. In addition, these algorithms come with many other attractive properties: they are simple, intuitive, easily parallelizable, scalable, and can handle missing data. Furthermore, I extend my unified MDS framework to build scalable algorithms for dimensionality reduction, and also to solve a sensor network localization problem for mobile sensors. Experimental results show the effectiveness of this framework across all problems. In the second part of my dissertation, I turn to problems in machine learning, in particular, use geometry to reason about conjugate priors, develop a model that hybridizes between discriminative and generative frameworks, and build a new set of generative-process-driven kernels. More specifically, this part of my dissertation is devoted to the study of the geometry of the space of probabilistic models associated with statistical generative processes. This study --- based on the theory well grounded in information geometry --- allows me to reason about the appropriateness of conjugate priors from a geometric perspective, and hence gain insight into the large number of existing models that rely on these priors. Furthermore, I use this study to build hybrid models more naturally i.e., by combining discriminative and generative methods using the geometry underlying them, and also to build a family of kernels called generative kernels that can be used as off-the-shelf tool in any kernel learning method such as support vector machines. My experiments of generative kernels demonstrate their effectiveness providing further evidence in favor of using geometric methods

    Optimization tools for non-asymptotic statistics in exponential families

    Full text link
    Les familles exponentielles sont une classe de modèles omniprésente en statistique. D'une part, elle peut modéliser n'importe quel type de données. En fait la plupart des distributions communes en font partie : Gaussiennes, variables catégoriques, Poisson, Gamma, Wishart, Dirichlet. D'autre part elle est à la base des modèles linéaires généralisés (GLM), une classe de modèles fondamentale en apprentissage automatique. Enfin les mathématiques qui les sous-tendent sont souvent magnifiques, grâce à leur lien avec la dualité convexe et la transformée de Laplace. L'auteur de cette thèse a fréquemment été motivé par cette beauté. Dans cette thèse, nous faisons trois contributions à l'intersection de l'optimisation et des statistiques, qui tournent toutes autour de la famille exponentielle. La première contribution adapte et améliore un algorithme d'optimisation à variance réduite appelé ascension des coordonnées duales stochastique (SDCA), pour entraîner une classe particulière de GLM appelée champ aléatoire conditionnel (CRF). Les CRF sont un des piliers de la prédiction structurée. Les CRF étaient connus pour être difficiles à entraîner jusqu'à la découverte des technique d'optimisation à variance réduite. Notre version améliorée de SDCA obtient des performances favorables comparées à l'état de l'art antérieur et actuel. La deuxième contribution s'intéresse à la découverte causale. Les familles exponentielles sont fréquemment utilisées dans les modèles graphiques, et en particulier dans les modèles graphique causaux. Cette contribution mène l'enquête sur une conjecture spécifique qui a attiré l'attention dans de précédents travaux : les modèles causaux s'adaptent plus rapidement aux perturbations de l'environnement. Nos résultats, obtenus à partir de théorèmes d'optimisation, soutiennent cette hypothèse sous certaines conditions. Mais sous d'autre conditions, nos résultats contredisent cette hypothèse. Cela appelle à une précision de cette hypothèse, ou à une sophistication de notre notion de modèle causal. La troisième contribution s'intéresse à une propriété fondamentale des familles exponentielles. L'une des propriétés les plus séduisantes des familles exponentielles est la forme close de l'estimateur du maximum de vraisemblance (MLE), ou maximum a posteriori (MAP) pour un choix naturel de prior conjugué. Ces deux estimateurs sont utilisés presque partout, souvent sans même y penser. (Combien de fois calcule-t-on une moyenne et une variance pour des données en cloche sans penser au modèle Gaussien sous-jacent ?) Pourtant la littérature actuelle manque de résultats sur la convergence de ces modèles pour des tailles d'échantillons finis, lorsque l'on mesure la qualité de ces modèles avec la divergence de Kullback-Leibler (KL). Pourtant cette divergence est la mesure de différence standard en théorie de l'information. En établissant un parallèle avec l'optimisation, nous faisons quelques pas vers un tel résultat, et nous relevons quelques directions pouvant mener à des progrès, tant en statistiques qu'en optimisation. Ces trois contributions mettent des outil d'optimisation au service des statistiques dans les familles exponentielles : améliorer la vitesse d'apprentissage de GLM de prédiction structurée, caractériser la vitesse d'adaptation de modèles causaux, estimer la vitesse d'apprentissage de modèles omniprésents. En traçant des ponts entre statistiques et optimisation, cette thèse fait progresser notre maîtrise de méthodes fondamentales d'apprentissage automatique.Exponential families are a ubiquitous class of models in statistics. On the one hand, they can model any data type. Actually, the most common distributions are exponential families: Gaussians, categorical, Poisson, Gamma, Wishart, or Dirichlet. On the other hand, they sit at the core of generalized linear models (GLM), a foundational class of models in machine learning. They are also supported by beautiful mathematics thanks to their connection with convex duality and the Laplace transform. This beauty is definitely responsible for the existence of this thesis. In this manuscript, we make three contributions at the intersection of optimization and statistics, all revolving around exponential families. The first contribution adapts and improves a variance reduction optimization algorithm called stochastic dual coordinate ascent (SDCA) to train a particular class of GLM called conditional random fields (CRF). CRF are one of the cornerstones of structured prediction. CRF were notoriously hard to train until the advent of variance reduction techniques, and our improved version of SDCA performs favorably compared to the previous state-of-the-art. The second contribution focuses on causal discovery. Exponential families are widely used in graphical models, and in particular in causal graphical models. This contribution investigates a specific conjecture that gained some traction in previous work: causal models adapt faster to perturbations of the environment. Using results from optimization, we find strong support for this assumption when the perturbation is coming from an intervention on a cause, and support against this assumption when perturbation is coming from an intervention on an effect. These pieces of evidence are calling for a refinement of the conjecture. The third contribution addresses a fundamental property of exponential families. One of the most appealing properties of exponential families is its closed-form maximum likelihood estimate (MLE) and maximum a posteriori (MAP) for a natural choice of conjugate prior. These two estimators are used almost everywhere, often unknowingly -- how often are mean and variance computed for bell-shaped data without thinking about the Gaussian model they underly? Nevertheless, literature to date lacks results on the finite sample convergence property of the information (Kulback-Leibler) divergence between these estimators and the true distribution. Drawing on a parallel with optimization, we take some steps towards such a result, and we highlight directions for progress both in statistics and optimization. These three contributions are all using tools from optimization at the service of statistics in exponential families: improving upon an algorithm to learn GLM, characterizing the adaptation speed of causal models, and estimating the learning speed of ubiquitous models. By tying together optimization and statistics, this thesis is taking a step towards a better understanding of the fundamentals of machine learning

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p
    corecore