377 research outputs found
Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes
We define a family of probability distributions for random count matrices
with a potentially unbounded number of rows and columns. The three
distributions we consider are derived from the gamma-Poisson, gamma-negative
binomial, and beta-negative binomial processes. Because the models lead to
closed-form Gibbs sampling update equations, they are natural candidates for
nonparametric Bayesian priors over count matrices. A key aspect of our analysis
is the recognition that, although the random count matrices within the family
are defined by a row-wise construction, their columns can be shown to be i.i.d.
This fact is used to derive explicit formulas for drawing all the columns at
once. Moreover, by analyzing these matrices' combinatorial structure, we
describe how to sequentially construct a column-i.i.d. random count matrix one
row at a time, and derive the predictive distribution of a new row count vector
with previously unseen features. We describe the similarities and differences
between the three priors, and argue that the greater flexibility of the gamma-
and beta- negative binomial processes, especially their ability to model
over-dispersed, heavy-tailed count data, makes these well suited to a wide
variety of real-world applications. As an example of our framework, we
construct a naive-Bayes text classifier to categorize a count vector to one of
several existing random count matrices of different categories. The classifier
supports an unbounded number of features, and unlike most existing methods, it
does not require a predefined finite vocabulary to be shared by all the
categories, and needs neither feature selection nor parameter tuning. Both the
gamma- and beta- negative binomial processes are shown to significantly
outperform the gamma-Poisson process for document categorization, with
comparable performance to other state-of-the-art supervised text classification
algorithms.Comment: To appear in Journal of the American Statistical Association (Theory
and Methods). 31 pages + 11 page supplement, 5 figure
Nonparametric Network Models for Link Prediction
Abstract Many data sets can be represented as a sequence of interactions between entities-for example communications between individuals in a social network, protein-protein interactions or DNA-protein interactions in a biological context, or vehicles' journeys between cities. In these contexts, there is often interest in making predictions about future interactions, such as who will message whom. A popular approach to network modeling in a Bayesian context is to assume that the observed interactions can be explained in terms of some latent structure. For example, traffic patterns might be explained by the size and importance of cities, and social network interactions might be explained by the social groups and interests of individuals. Unfortunately, while elucidating this structure can be useful, it often does not directly translate into an effective predictive tool. Further, many existing approaches are not appropriate for sparse networks, a class that includes many interesting real-world situations. In this paper, we develop models for sparse networks that combine structure elucidation with predictive performance. We use a Bayesian nonparametric approach, which allows us to predict interactions with entities outside our training set, and allows the both the latent dimensionality of the model and the number of nodes in the network to grow in expectation as we see more data. We demonstrate that we can capture latent structure while maintaining predictive power, and discuss possible extensions
Bayesian analysis of hierarchical models for polychotomous data from a multi-stage cluster sample
In this thesis we present a hierarchical Bayesian methodology for analyzing polychotomous data from multi-stage cluster samples. We begin with a model for multinomial data drawn from a two-stage cluster sample of a finite population. This model is then extended to incorporate partially observed data assuming that the data are missing at random (MAR), in the terminology of Little and Rubin (1987). We next develop a model for polychotomous data collected via a three-stage cluster sample. As with the two-stage model, we describe the methodology for dealing with partially observed data assuming they are MAR. We apply these two methodologies to the 1990 Slovenian Public Opinion Survey and present the results of these analyses. Finally, we fashion a multivariate probit model for a special type of multinomial data, multivariate binary data. We then construct this model that incorporates covariate information for the case of a two-stage cluster sample. Specifically, we outline this methodology for a two-stage cluster sample. This approach also allows for the integration of missing data into the analysis if the data are MAR. For all of the above models we use Markov chain Monte Carlo techniques to simulate samples from the posterior distribution. These samples are then utilized in making inference from the models
HOW MANY WORDS ARE THERE?
The commonsensical assumption that any language has only finitely many words is shown to be false by a combination of formal and empirical arguments. Zipf's Law and related formulas are investigated and a more complex model is offered
Bayesian nonparametric models for data exploration
Mención Internacional en el título de doctorMaking sense out of data is one of the biggest challenges of our time. With the emergence of
technologies such as the Internet, sensor networks or deep genome sequencing, a true data explosion
has been unleashed that affects all fields of science and our everyday life. Recent breakthroughs, such
as self-driven cars or champion-level Go player programs, have demonstrated the potential benefits
from exploiting data, mostly in well-defined supervised tasks. However, we have barely started to
actually explore and truly understand data.
In fact, data holds valuable information for answering most important questions for humanity:
How does aging impact our physical capabilities? What are the underlying mechanisms of cancer?
Which factors make countries wealthier than others? Most of these questions cannot be stated as
well-defined supervised problems, and might benefit enormously from multidisciplinary research
efforts involving easy-to-interpret models and rigorous data exploratory analyses. Efficient data exploration
might lead to life-changing scientific discoveries, which can later be turned into a more impactful
exploitation phase, to put forward more informed policy recommendations, decision-making
systems, medical protocols or improved models for highly accurate predictions.
This thesis proposes tailored Bayesian nonparametric (BNP) models to solve specific data exploratory
tasks across different scientific areas including sport sciences, cancer research, and economics.
We resort to BNP approaches to facilitate the discovery of unexpected hidden patterns
within data. BNP models place a prior distribution over an infinite-dimensional parameter space,
which makes them particularly useful in probabilistic models where the number of hidden parameters
is unknown a priori. Under this prior distribution, the posterior distribution of the hidden parameters
given the data will assign high probability mass to those configurations that best explain the
observations. Hence, inference over the hidden variables can be performed using standard Bayesian
inference techniques, therefore avoiding expensive model selection steps.
This thesis is application-focused and highly multidisciplinary. More precisely, we propose an
automatic grading system for sportive competitions to compare athletic performance regardless of
age, gender and environmental aspects; we develop BNP models to perform genetic association
and biomarker discovery in cancer research, either using genetic information and Electronic Health
Records or clinical trial data; finally, we present a flexible infinite latent factor model of international
trade data to understand the underlying economic structure of countries and their evolution over time.Uno de los principales desafíos de nuestro tiempo es encontrar sentido dentro de los datos. Con la aparición de tecnologías como Internet, redes de sensores, o métodos de secuenciación profunda
del genoma, una verdadera explosión digital se ha visto desencadenada, afectando todos los campos científicos, así como nuestra vida diaria. Logros recientes como pueden ser los coches auto-dirigidos o programas que ganan a los seres humanos al milenario juego del Go, han demostrado con creces los posibles beneficios que podemos obtener de la explotación de datos, mayoritariamente en tareas
supervisadas bien definidas. No obstante, apenas hemos empezado con la exploración de datos y su verdadero entendimiento.
En verdad, los datos encierran información muy valiosa para responder a muchas de las preguntas
más importantes para la humanidad: ¿Cómo afecta el envejecimiento a nuestras aptitudes físicas?
¿Cuáles son los mecanismos subyacentes del cáncer? ¿Qué factores explican la riqueza de ciertos
países frente a otros? Si bien la mayoría de estas preguntas no pueden formularse como problemas
supervisados bien definidos, éstas pueden ser abordadas mediante esfuerzos de investigación
multidisciplinar que involucren modelos fáciles de interpretar y análisis exploratorios rigurosos. Explorar los datos de manera eficiente abre potencialmente la puerta a un sinnúmero de descubrimientos
científicos en diversas áreas con impacto real en nuestras vidas, descubrimientos que a su vez pueden llevarnos a una mejor explotación de los datos, resultando en recomendaciones políticas adecuadas, sistemas precisos de toma de decisión, protocolos médicos optimizados o modelos con mejores capacidades
predictivas. Esta tesis propone modelos Bayesianos no-paramétricos (BNP) adecuados para la resolución específica de tareas explorativas de los datos en diversos ámbitos científicos incluyendo ciencias del
deporte, investigación contra el cáncer, o economía. Recurrimos a un planteamiento BNP para facilitar el descubrimiento de patrones ocultos inesperados subyacentes en los datos. Los modelos
BNP definen una distribución a priori sobre un espacio de parámetros de dimensión infinita, lo cual los hace especialmente atractivos para enfoques probabilísticos donde el número de parámetros latentes
es en principio desconocido. Bajo dicha distribución a priori, la distribución a posteriori de los parámetros ocultos dados los datos asignará mayor probabilidad a aquellas configuraciones que
mejor explican las observaciones. De esta manera, la inferencia sobre el espacio de variables ocultas puede realizarse mediante técnicas estándar de inferencia Bayesiana, evitando el proceso de selección
de modelos. Esta tesis se centra en el ámbito de las aplicaciones, y es de naturaleza multidisciplinar. En
concreto, proponemos un sistema de gradación automática para comparar el rendimiento deportivo
de atletas independientemente de su edad o género, así como de otros factores del entorno. Desarrollamos
modelos BNP para descubrir asociaciones genéticas y biomarcadores dentro de la investigación contra el cáncer, ya sea contrastando información genética con la historia clínica electrónica
de los pacientes, o utilizando datos de ensayos clínicos; finalmente, presentamos un modelo flexible
de factores latentes infinito para datos de comercio internacional, con el objetivo de entender la
estructura económica de los distintos países y su correspondiente evolución a lo largo del tiempo.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Joaquín Míguez Arenas.- Secretario: Daniel Hernández Lobato.- Vocal: Cédric Archambea
Consistent estimation of small masses in feature sampling
Consider an (observable) random sample of size n from an infinite population of individuals, each individual being endowed with a finite set of features from a collection of features (Fj)j≥1 with unknown probabilities (pj)j≥1, i.e., pj is the probability that an individual displays feature Fj. Under this feature sampling framework, in recent years there has been a growing interest in estimating the sum of the probability masses pj's of features observed with frequency r≥0 in the sample, here denoted by Mn,r. This is the natural feature sampling counterpart of the classical problem of estimating small probabilities in the species sampling framework, where each individual is endowed with only one feature (or “species"). In this paper we study the problem of consistent estimation of the small mass Mn,r. We first show that there do not exist universally consistent estimators, in the multiplicative sense, of the missing mass Mn,0. Then, we introduce an estimator of Mn,r and identify sufficient conditions under which the estimator is consistent. In particular, we propose a nonparametric estimator M^n,r of Mn,r which has the same analytic form of the celebrated Good--Turing estimator for small probabilities, with the sole difference that the two estimators have different ranges (supports). Then, we show that M^n,r is strongly consistent, in the multiplicative sense, under the assumption that (pj)j≥1 has regularly varying heavy tails
Learning Non-Parametric and High-Dimensional Distributions via Information-Theoretic Methods
Learning distributions that govern generation of data and estimation of related functionals are the foundations of many classical statistical problems. In the following dissertation we intend to investigate such topics when either the hypothesized model is non-parametric or the number of free parameters in the model grows along with the sample size. Especially, we study the above scenarios for the following class of problems with the goal of obtaining minimax rate-optimal methods for learning the target distributions when the sample size is finite. Our techniques are based on information-theoretic divergences and related mutual-information based methods. (i) Estimation in compound decision and empirical Bayes settings: To estimate the data-generating distribution, one often takes the following two-step approach. In the first step the statistician estimates the distribution of the parameters, either the empirical distribution or the postulated prior, and then in the second step plugs in the estimate to approximate the target of interest. In the literature, the estimation of empirical distribution is known as the compound decision problem and the estimation of prior is known as the problem of empirical Bayes. In our work we use the method of minimum-distance estimation for approximating these distributions. Considering certain discrete data setups, we show that the minimum-distance based method provides theoretically and practically sound choices for estimation. The computational and algorithmic aspects of the estimators are also analyzed. (ii) Prediction with Markov chains: Given observations from an unknown Markov chain, we study the problem of predicting the next entry in the trajectory. Existing analysis for such a dependent setup usually centers around concentration inequalities that uses various extraneous conditions on the mixing properties. This makes it difficult to achieve results independent of such restrictions. We introduce information-theoretic techniques to bypass such issues and obtain fundamental limits for the related minimax problems. We also analyze conditions on the mixing properties that produce a parametric rate of prediction errors
- …