377 research outputs found

    Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes

    Full text link
    We define a family of probability distributions for random count matrices with a potentially unbounded number of rows and columns. The three distributions we consider are derived from the gamma-Poisson, gamma-negative binomial, and beta-negative binomial processes. Because the models lead to closed-form Gibbs sampling update equations, they are natural candidates for nonparametric Bayesian priors over count matrices. A key aspect of our analysis is the recognition that, although the random count matrices within the family are defined by a row-wise construction, their columns can be shown to be i.i.d. This fact is used to derive explicit formulas for drawing all the columns at once. Moreover, by analyzing these matrices' combinatorial structure, we describe how to sequentially construct a column-i.i.d. random count matrix one row at a time, and derive the predictive distribution of a new row count vector with previously unseen features. We describe the similarities and differences between the three priors, and argue that the greater flexibility of the gamma- and beta- negative binomial processes, especially their ability to model over-dispersed, heavy-tailed count data, makes these well suited to a wide variety of real-world applications. As an example of our framework, we construct a naive-Bayes text classifier to categorize a count vector to one of several existing random count matrices of different categories. The classifier supports an unbounded number of features, and unlike most existing methods, it does not require a predefined finite vocabulary to be shared by all the categories, and needs neither feature selection nor parameter tuning. Both the gamma- and beta- negative binomial processes are shown to significantly outperform the gamma-Poisson process for document categorization, with comparable performance to other state-of-the-art supervised text classification algorithms.Comment: To appear in Journal of the American Statistical Association (Theory and Methods). 31 pages + 11 page supplement, 5 figure

    Nonparametric Network Models for Link Prediction

    Get PDF
    Abstract Many data sets can be represented as a sequence of interactions between entities-for example communications between individuals in a social network, protein-protein interactions or DNA-protein interactions in a biological context, or vehicles' journeys between cities. In these contexts, there is often interest in making predictions about future interactions, such as who will message whom. A popular approach to network modeling in a Bayesian context is to assume that the observed interactions can be explained in terms of some latent structure. For example, traffic patterns might be explained by the size and importance of cities, and social network interactions might be explained by the social groups and interests of individuals. Unfortunately, while elucidating this structure can be useful, it often does not directly translate into an effective predictive tool. Further, many existing approaches are not appropriate for sparse networks, a class that includes many interesting real-world situations. In this paper, we develop models for sparse networks that combine structure elucidation with predictive performance. We use a Bayesian nonparametric approach, which allows us to predict interactions with entities outside our training set, and allows the both the latent dimensionality of the model and the number of nodes in the network to grow in expectation as we see more data. We demonstrate that we can capture latent structure while maintaining predictive power, and discuss possible extensions

    Bayesian analysis of hierarchical models for polychotomous data from a multi-stage cluster sample

    Get PDF
    In this thesis we present a hierarchical Bayesian methodology for analyzing polychotomous data from multi-stage cluster samples. We begin with a model for multinomial data drawn from a two-stage cluster sample of a finite population. This model is then extended to incorporate partially observed data assuming that the data are missing at random (MAR), in the terminology of Little and Rubin (1987). We next develop a model for polychotomous data collected via a three-stage cluster sample. As with the two-stage model, we describe the methodology for dealing with partially observed data assuming they are MAR. We apply these two methodologies to the 1990 Slovenian Public Opinion Survey and present the results of these analyses. Finally, we fashion a multivariate probit model for a special type of multinomial data, multivariate binary data. We then construct this model that incorporates covariate information for the case of a two-stage cluster sample. Specifically, we outline this methodology for a two-stage cluster sample. This approach also allows for the integration of missing data into the analysis if the data are MAR. For all of the above models we use Markov chain Monte Carlo techniques to simulate samples from the posterior distribution. These samples are then utilized in making inference from the models

    HOW MANY WORDS ARE THERE?

    Get PDF
    The commonsensical assumption that any language has only finitely many words is shown to be false by a combination of formal and empirical arguments. Zipf's Law and related formulas are investigated and a more complex model is offered

    Bayesian nonparametric models for data exploration

    Get PDF
    Mención Internacional en el título de doctorMaking sense out of data is one of the biggest challenges of our time. With the emergence of technologies such as the Internet, sensor networks or deep genome sequencing, a true data explosion has been unleashed that affects all fields of science and our everyday life. Recent breakthroughs, such as self-driven cars or champion-level Go player programs, have demonstrated the potential benefits from exploiting data, mostly in well-defined supervised tasks. However, we have barely started to actually explore and truly understand data. In fact, data holds valuable information for answering most important questions for humanity: How does aging impact our physical capabilities? What are the underlying mechanisms of cancer? Which factors make countries wealthier than others? Most of these questions cannot be stated as well-defined supervised problems, and might benefit enormously from multidisciplinary research efforts involving easy-to-interpret models and rigorous data exploratory analyses. Efficient data exploration might lead to life-changing scientific discoveries, which can later be turned into a more impactful exploitation phase, to put forward more informed policy recommendations, decision-making systems, medical protocols or improved models for highly accurate predictions. This thesis proposes tailored Bayesian nonparametric (BNP) models to solve specific data exploratory tasks across different scientific areas including sport sciences, cancer research, and economics. We resort to BNP approaches to facilitate the discovery of unexpected hidden patterns within data. BNP models place a prior distribution over an infinite-dimensional parameter space, which makes them particularly useful in probabilistic models where the number of hidden parameters is unknown a priori. Under this prior distribution, the posterior distribution of the hidden parameters given the data will assign high probability mass to those configurations that best explain the observations. Hence, inference over the hidden variables can be performed using standard Bayesian inference techniques, therefore avoiding expensive model selection steps. This thesis is application-focused and highly multidisciplinary. More precisely, we propose an automatic grading system for sportive competitions to compare athletic performance regardless of age, gender and environmental aspects; we develop BNP models to perform genetic association and biomarker discovery in cancer research, either using genetic information and Electronic Health Records or clinical trial data; finally, we present a flexible infinite latent factor model of international trade data to understand the underlying economic structure of countries and their evolution over time.Uno de los principales desafíos de nuestro tiempo es encontrar sentido dentro de los datos. Con la aparición de tecnologías como Internet, redes de sensores, o métodos de secuenciación profunda del genoma, una verdadera explosión digital se ha visto desencadenada, afectando todos los campos científicos, así como nuestra vida diaria. Logros recientes como pueden ser los coches auto-dirigidos o programas que ganan a los seres humanos al milenario juego del Go, han demostrado con creces los posibles beneficios que podemos obtener de la explotación de datos, mayoritariamente en tareas supervisadas bien definidas. No obstante, apenas hemos empezado con la exploración de datos y su verdadero entendimiento. En verdad, los datos encierran información muy valiosa para responder a muchas de las preguntas más importantes para la humanidad: ¿Cómo afecta el envejecimiento a nuestras aptitudes físicas? ¿Cuáles son los mecanismos subyacentes del cáncer? ¿Qué factores explican la riqueza de ciertos países frente a otros? Si bien la mayoría de estas preguntas no pueden formularse como problemas supervisados bien definidos, éstas pueden ser abordadas mediante esfuerzos de investigación multidisciplinar que involucren modelos fáciles de interpretar y análisis exploratorios rigurosos. Explorar los datos de manera eficiente abre potencialmente la puerta a un sinnúmero de descubrimientos científicos en diversas áreas con impacto real en nuestras vidas, descubrimientos que a su vez pueden llevarnos a una mejor explotación de los datos, resultando en recomendaciones políticas adecuadas, sistemas precisos de toma de decisión, protocolos médicos optimizados o modelos con mejores capacidades predictivas. Esta tesis propone modelos Bayesianos no-paramétricos (BNP) adecuados para la resolución específica de tareas explorativas de los datos en diversos ámbitos científicos incluyendo ciencias del deporte, investigación contra el cáncer, o economía. Recurrimos a un planteamiento BNP para facilitar el descubrimiento de patrones ocultos inesperados subyacentes en los datos. Los modelos BNP definen una distribución a priori sobre un espacio de parámetros de dimensión infinita, lo cual los hace especialmente atractivos para enfoques probabilísticos donde el número de parámetros latentes es en principio desconocido. Bajo dicha distribución a priori, la distribución a posteriori de los parámetros ocultos dados los datos asignará mayor probabilidad a aquellas configuraciones que mejor explican las observaciones. De esta manera, la inferencia sobre el espacio de variables ocultas puede realizarse mediante técnicas estándar de inferencia Bayesiana, evitando el proceso de selección de modelos. Esta tesis se centra en el ámbito de las aplicaciones, y es de naturaleza multidisciplinar. En concreto, proponemos un sistema de gradación automática para comparar el rendimiento deportivo de atletas independientemente de su edad o género, así como de otros factores del entorno. Desarrollamos modelos BNP para descubrir asociaciones genéticas y biomarcadores dentro de la investigación contra el cáncer, ya sea contrastando información genética con la historia clínica electrónica de los pacientes, o utilizando datos de ensayos clínicos; finalmente, presentamos un modelo flexible de factores latentes infinito para datos de comercio internacional, con el objetivo de entender la estructura económica de los distintos países y su correspondiente evolución a lo largo del tiempo.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Joaquín Míguez Arenas.- Secretario: Daniel Hernández Lobato.- Vocal: Cédric Archambea

    Consistent estimation of small masses in feature sampling

    Get PDF
    Consider an (observable) random sample of size n from an infinite population of individuals, each individual being endowed with a finite set of features from a collection of features (Fj)j≥1 with unknown probabilities (pj)j≥1, i.e., pj is the probability that an individual displays feature Fj. Under this feature sampling framework, in recent years there has been a growing interest in estimating the sum of the probability masses pj's of features observed with frequency r≥0 in the sample, here denoted by Mn,r. This is the natural feature sampling counterpart of the classical problem of estimating small probabilities in the species sampling framework, where each individual is endowed with only one feature (or “species"). In this paper we study the problem of consistent estimation of the small mass Mn,r. We first show that there do not exist universally consistent estimators, in the multiplicative sense, of the missing mass Mn,0. Then, we introduce an estimator of Mn,r and identify sufficient conditions under which the estimator is consistent. In particular, we propose a nonparametric estimator M^n,r of Mn,r which has the same analytic form of the celebrated Good--Turing estimator for small probabilities, with the sole difference that the two estimators have different ranges (supports). Then, we show that M^n,r is strongly consistent, in the multiplicative sense, under the assumption that (pj)j≥1 has regularly varying heavy tails

    Learning Non-Parametric and High-Dimensional Distributions via Information-Theoretic Methods

    Get PDF
    Learning distributions that govern generation of data and estimation of related functionals are the foundations of many classical statistical problems. In the following dissertation we intend to investigate such topics when either the hypothesized model is non-parametric or the number of free parameters in the model grows along with the sample size. Especially, we study the above scenarios for the following class of problems with the goal of obtaining minimax rate-optimal methods for learning the target distributions when the sample size is finite. Our techniques are based on information-theoretic divergences and related mutual-information based methods. (i) Estimation in compound decision and empirical Bayes settings: To estimate the data-generating distribution, one often takes the following two-step approach. In the first step the statistician estimates the distribution of the parameters, either the empirical distribution or the postulated prior, and then in the second step plugs in the estimate to approximate the target of interest. In the literature, the estimation of empirical distribution is known as the compound decision problem and the estimation of prior is known as the problem of empirical Bayes. In our work we use the method of minimum-distance estimation for approximating these distributions. Considering certain discrete data setups, we show that the minimum-distance based method provides theoretically and practically sound choices for estimation. The computational and algorithmic aspects of the estimators are also analyzed. (ii) Prediction with Markov chains: Given observations from an unknown Markov chain, we study the problem of predicting the next entry in the trajectory. Existing analysis for such a dependent setup usually centers around concentration inequalities that uses various extraneous conditions on the mixing properties. This makes it difficult to achieve results independent of such restrictions. We introduce information-theoretic techniques to bypass such issues and obtain fundamental limits for the related minimax problems. We also analyze conditions on the mixing properties that produce a parametric rate of prediction errors

    Semi-continuous hidden Markov models for speech recognition

    Get PDF
    corecore