1,488 research outputs found
A Bayesian nonparametric approach for the analysis of multiple categorical item responses
We develop a modeling framework for joint factor and cluster analysis of datasets where multiple categorical response items are collected on a heterogeneous population of individuals. We introduce a latent factor multinomial probit model and employ prior constructions that allow inference on the number of factors as well as clustering of the subjects into homogeneous groups according to their relevant factors. Clustering, in particular, allows us to borrow strength across subjects, therefore helping in the estimation of the model parameters, particularly when the number of observations is small. We employ Markov chain Monte Carlo techniques and obtain tractable posterior inference for our objectives, including sampling of missing data. We demonstrate the effectiveness of our method on simulated data. We also analyze two real-world educational datasets and show that our method outperforms state-of-the-art methods. In the analysis of the real-world data, we uncover hidden relationships between the questions and the underlying educational concepts, while simultaneously partitioning the students into groups of similar educational mastery
Bayesian nonparametric models for data exploration
Mención Internacional en el título de doctorMaking sense out of data is one of the biggest challenges of our time. With the emergence of
technologies such as the Internet, sensor networks or deep genome sequencing, a true data explosion
has been unleashed that affects all fields of science and our everyday life. Recent breakthroughs, such
as self-driven cars or champion-level Go player programs, have demonstrated the potential benefits
from exploiting data, mostly in well-defined supervised tasks. However, we have barely started to
actually explore and truly understand data.
In fact, data holds valuable information for answering most important questions for humanity:
How does aging impact our physical capabilities? What are the underlying mechanisms of cancer?
Which factors make countries wealthier than others? Most of these questions cannot be stated as
well-defined supervised problems, and might benefit enormously from multidisciplinary research
efforts involving easy-to-interpret models and rigorous data exploratory analyses. Efficient data exploration
might lead to life-changing scientific discoveries, which can later be turned into a more impactful
exploitation phase, to put forward more informed policy recommendations, decision-making
systems, medical protocols or improved models for highly accurate predictions.
This thesis proposes tailored Bayesian nonparametric (BNP) models to solve specific data exploratory
tasks across different scientific areas including sport sciences, cancer research, and economics.
We resort to BNP approaches to facilitate the discovery of unexpected hidden patterns
within data. BNP models place a prior distribution over an infinite-dimensional parameter space,
which makes them particularly useful in probabilistic models where the number of hidden parameters
is unknown a priori. Under this prior distribution, the posterior distribution of the hidden parameters
given the data will assign high probability mass to those configurations that best explain the
observations. Hence, inference over the hidden variables can be performed using standard Bayesian
inference techniques, therefore avoiding expensive model selection steps.
This thesis is application-focused and highly multidisciplinary. More precisely, we propose an
automatic grading system for sportive competitions to compare athletic performance regardless of
age, gender and environmental aspects; we develop BNP models to perform genetic association
and biomarker discovery in cancer research, either using genetic information and Electronic Health
Records or clinical trial data; finally, we present a flexible infinite latent factor model of international
trade data to understand the underlying economic structure of countries and their evolution over time.Uno de los principales desafíos de nuestro tiempo es encontrar sentido dentro de los datos. Con la aparición de tecnologías como Internet, redes de sensores, o métodos de secuenciación profunda
del genoma, una verdadera explosión digital se ha visto desencadenada, afectando todos los campos científicos, así como nuestra vida diaria. Logros recientes como pueden ser los coches auto-dirigidos o programas que ganan a los seres humanos al milenario juego del Go, han demostrado con creces los posibles beneficios que podemos obtener de la explotación de datos, mayoritariamente en tareas
supervisadas bien definidas. No obstante, apenas hemos empezado con la exploración de datos y su verdadero entendimiento.
En verdad, los datos encierran información muy valiosa para responder a muchas de las preguntas
más importantes para la humanidad: ¿Cómo afecta el envejecimiento a nuestras aptitudes físicas?
¿Cuáles son los mecanismos subyacentes del cáncer? ¿Qué factores explican la riqueza de ciertos
países frente a otros? Si bien la mayoría de estas preguntas no pueden formularse como problemas
supervisados bien definidos, éstas pueden ser abordadas mediante esfuerzos de investigación
multidisciplinar que involucren modelos fáciles de interpretar y análisis exploratorios rigurosos. Explorar los datos de manera eficiente abre potencialmente la puerta a un sinnúmero de descubrimientos
científicos en diversas áreas con impacto real en nuestras vidas, descubrimientos que a su vez pueden llevarnos a una mejor explotación de los datos, resultando en recomendaciones políticas adecuadas, sistemas precisos de toma de decisión, protocolos médicos optimizados o modelos con mejores capacidades
predictivas. Esta tesis propone modelos Bayesianos no-paramétricos (BNP) adecuados para la resolución específica de tareas explorativas de los datos en diversos ámbitos científicos incluyendo ciencias del
deporte, investigación contra el cáncer, o economía. Recurrimos a un planteamiento BNP para facilitar el descubrimiento de patrones ocultos inesperados subyacentes en los datos. Los modelos
BNP definen una distribución a priori sobre un espacio de parámetros de dimensión infinita, lo cual los hace especialmente atractivos para enfoques probabilísticos donde el número de parámetros latentes
es en principio desconocido. Bajo dicha distribución a priori, la distribución a posteriori de los parámetros ocultos dados los datos asignará mayor probabilidad a aquellas configuraciones que
mejor explican las observaciones. De esta manera, la inferencia sobre el espacio de variables ocultas puede realizarse mediante técnicas estándar de inferencia Bayesiana, evitando el proceso de selección
de modelos. Esta tesis se centra en el ámbito de las aplicaciones, y es de naturaleza multidisciplinar. En
concreto, proponemos un sistema de gradación automática para comparar el rendimiento deportivo
de atletas independientemente de su edad o género, así como de otros factores del entorno. Desarrollamos
modelos BNP para descubrir asociaciones genéticas y biomarcadores dentro de la investigación contra el cáncer, ya sea contrastando información genética con la historia clínica electrónica
de los pacientes, o utilizando datos de ensayos clínicos; finalmente, presentamos un modelo flexible
de factores latentes infinito para datos de comercio internacional, con el objetivo de entender la
estructura económica de los distintos países y su correspondiente evolución a lo largo del tiempo.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Joaquín Míguez Arenas.- Secretario: Daniel Hernández Lobato.- Vocal: Cédric Archambea
Infinite Factorial Finite State Machine for Blind Multiuser Channel Estimation
New communication standards need to deal with machine-to-machine
communications, in which users may start or stop transmitting at any time in an
asynchronous manner. Thus, the number of users is an unknown and time-varying
parameter that needs to be accurately estimated in order to properly recover
the symbols transmitted by all users in the system. In this paper, we address
the problem of joint channel parameter and data estimation in a multiuser
communication channel in which the number of transmitters is not known. For
that purpose, we develop the infinite factorial finite state machine model, a
Bayesian nonparametric model based on the Markov Indian buffet that allows for
an unbounded number of transmitters with arbitrary channel length. We propose
an inference algorithm that makes use of slice sampling and particle Gibbs with
ancestor sampling. Our approach is fully blind as it does not require a prior
channel estimation step, prior knowledge of the number of transmitters, or any
signaling information. Our experimental results, loosely based on the LTE
random access channel, show that the proposed approach can effectively recover
the data-generating process for a wide range of scenarios, with varying number
of transmitters, number of receivers, constellation order, channel length, and
signal-to-noise ratio.Comment: 15 pages, 15 figure
A survey on Bayesian nonparametric learning
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. Bayesian (machine) learning has been playing a significant role in machine learning for a long time due to its particular ability to embrace uncertainty, encode prior knowledge, and endow interpretability. On the back of Bayesian learning's great success, Bayesian nonparametric learning (BNL) has emerged as a force for further advances in this field due to its greater modelling flexibility and representation power. Instead of playing with the fixed-dimensional probabilistic distributions of Bayesian learning, BNL creates a new “game” with infinite-dimensional stochastic processes. BNL has long been recognised as a research subject in statistics, and, to date, several state-of-the-art pilot studies have demonstrated that BNL has a great deal of potential to solve real-world machine-learning tasks. However, despite these promising results, BNL has not created a huge wave in the machine-learning community. Esotericism may account for this. The books and surveys on BNL written by statisticians are overcomplicated and filled with tedious theories and proofs. Each is certainly meaningful but may scare away new researchers, especially those with computer science backgrounds. Hence, the aim of this article is to provide a plain-spoken, yet comprehensive, theoretical survey of BNL in terms that researchers in the machine-learning community can understand. It is hoped this survey will serve as a starting point for understanding and exploiting the benefits of BNL in our current scholarly endeavours. To achieve this goal, we have collated the extant studies in this field and aligned them with the steps of a standard BNL procedure-from selecting the appropriate stochastic processes through manipulation to executing the model inference algorithms. At each step, past efforts have been thoroughly summarised and discussed. In addition, we have reviewed the common methods for implementing BNL in various machine-learning tasks along with its diverse applications in the real world as examples to motivate future studies
- …