1,574 research outputs found
Computational aspects of DNA mixture analysis
Statistical analysis of DNA mixtures is known to pose computational
challenges due to the enormous state space of possible DNA profiles. We propose
a Bayesian network representation for genotypes, allowing computations to be
performed locally involving only a few alleles at each step. In addition, we
describe a general method for computing the expectation of a product of
discrete random variables using auxiliary variables and probability propagation
in a Bayesian network, which in combination with the genotype network allows
efficient computation of the likelihood function and various other quantities
relevant to the inference. Lastly, we introduce a set of diagnostic tools for
assessing the adequacy of the model for describing a particular dataset
Recommended from our members
STATISTICAL METHODS TO STUDY TRANSPOSON SEQUENCING DATA: NONPARAMETRIC BAYESIAN MODELS WITH SAMPLING ALGORITHMS
As the development of Next Generation Sequencing(NGS) technology, researchers can easily obtain data from millions of cells( bulk samples) or just collecting data from a single cell. However, while bulk samples can capture broad changes, it may risk providing an average measurement that is not representative of the genetic state of any individual cell. While single-cell experiments can capture the genetic state of the individual cell, a single cell sample can increase uncertainty, sampling enough cells to gain a representative sample of population is expensive. Therefore, there is a need to integrate information from both bulk and single-cell data to obtain a comprehensive understanding of subclonal populations in an individual tumor as well as across individuals. The goal of this work is to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data. We propose a hierarchical Dirichlet process mixture model that incorporates the correlation structure induced by a structured sampling arrangement and we show that this model improves the quality of inference. We develop a representation of the hierarchical Dirichlet process prior as a Gamma-Poisson hierarchy and we use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method. Experiments shows that our model outperforms state-of-the-art methods
Another goal for analyzing genomic data is to understand which genes are essential and under what environmental conditions they are essential. Transposon sequencing method provides a powerful tool for researchers to find conditionally essential genes. However, methods are needed to go beyond a one-at-a-time analysis of conditionally essential genes and learn higher order representations that identify conditionally essential networks of genes. While the aforementioned methods do identify essential genes from transposon sequencing data, they do not provide a representation of the space of essential genes. For example, if two genes share the same pattern of essentiality across all conditions there is a higher level representation that couples those genes into a network. The goal of this work is to build such a higher level representations of the set of essential genes and identify genes that share essentiality patterns across conditions. To address this need, we develop a novel, computationally efficient hierarchical non-parametric Bayesian model: hierarchical Gamma-Poisson Process (hGP)
Bayesian Cointegrated Vector Autoregression models incorporating Alpha-stable noise for inter-day price movements via Approximate Bayesian Computation
We consider a statistical model for pairs of traded assets, based on a
Cointegrated Vector Auto Regression (CVAR) Model. We extend standard CVAR
models to incorporate estimation of model parameters in the presence of price
series level shifts which are not accurately modeled in the standard Gaussian
error correction model (ECM) framework. This involves developing a novel matrix
variate Bayesian CVAR mixture model comprised of Gaussian errors intra-day and
Alpha-stable errors inter-day in the ECM framework. To achieve this we derive a
novel conjugate posterior model for the Scaled Mixtures of Normals (SMiN CVAR)
representation of Alpha-stable inter-day innovations. These results are
generalized to asymmetric models for the innovation noise at inter-day
boundaries allowing for skewed Alpha-stable models.
Our proposed model and sampling methodology is general, incorporating the
current literature on Gaussian models as a special subclass and also allowing
for price series level shifts either at random estimated time points or known a
priori time points. We focus analysis on regularly observed non-Gaussian level
shifts that can have significant effect on estimation performance in
statistical models failing to account for such level shifts, such as at the
close and open of markets. We compare the estimation accuracy of our model and
estimation approach to standard frequentist and Bayesian procedures for CVAR
models when non-Gaussian price series level shifts are present in the
individual series, such as inter-day boundaries. We fit a bi-variate
Alpha-stable model to the inter-day jumps and model the effect of such jumps on
estimation of matrix-variate CVAR model parameters using the likelihood based
Johansen procedure and a Bayesian estimation. We illustrate our model and the
corresponding estimation procedures we develop on both synthetic and actual
data.Comment: 30 page
Bayesian statistics and modelling
Bayesian statistics is an approach to data analysis based on Bayes’ theorem, where available knowledge about parameters in a statistical model is updated with the information in observed data. The background knowledge is expressed as a prior distribution and combined with observational data in the form of a likelihood function to determine the posterior distribution. The posterior can also be used for making predictions about future events. This Primer describes the stages involved in Bayesian analysis, from specifying the prior and data models to deriving inference, model checking and refinement. We discuss the importance of prior and posterior predictive checking, selecting a proper technique for sampling from a posterior distribution, variational inference and variable selection. Examples of successful applications of Bayesian analysis across various research fields are provided, including in social sciences, ecology, genetics, medicine and more. We propose strategies for reproducibility and reporting standards, outlining an updated WAMBS (when to Worry and how to Avoid the Misuse of Bayesian Statistics) checklist. Finally, we outline the impact of Bayesian analysis on artificial intelligence, a major goal in the next decade
Bayesian nonparametric models for data exploration
Mención Internacional en el tÃtulo de doctorMaking sense out of data is one of the biggest challenges of our time. With the emergence of
technologies such as the Internet, sensor networks or deep genome sequencing, a true data explosion
has been unleashed that affects all fields of science and our everyday life. Recent breakthroughs, such
as self-driven cars or champion-level Go player programs, have demonstrated the potential benefits
from exploiting data, mostly in well-defined supervised tasks. However, we have barely started to
actually explore and truly understand data.
In fact, data holds valuable information for answering most important questions for humanity:
How does aging impact our physical capabilities? What are the underlying mechanisms of cancer?
Which factors make countries wealthier than others? Most of these questions cannot be stated as
well-defined supervised problems, and might benefit enormously from multidisciplinary research
efforts involving easy-to-interpret models and rigorous data exploratory analyses. Efficient data exploration
might lead to life-changing scientific discoveries, which can later be turned into a more impactful
exploitation phase, to put forward more informed policy recommendations, decision-making
systems, medical protocols or improved models for highly accurate predictions.
This thesis proposes tailored Bayesian nonparametric (BNP) models to solve specific data exploratory
tasks across different scientific areas including sport sciences, cancer research, and economics.
We resort to BNP approaches to facilitate the discovery of unexpected hidden patterns
within data. BNP models place a prior distribution over an infinite-dimensional parameter space,
which makes them particularly useful in probabilistic models where the number of hidden parameters
is unknown a priori. Under this prior distribution, the posterior distribution of the hidden parameters
given the data will assign high probability mass to those configurations that best explain the
observations. Hence, inference over the hidden variables can be performed using standard Bayesian
inference techniques, therefore avoiding expensive model selection steps.
This thesis is application-focused and highly multidisciplinary. More precisely, we propose an
automatic grading system for sportive competitions to compare athletic performance regardless of
age, gender and environmental aspects; we develop BNP models to perform genetic association
and biomarker discovery in cancer research, either using genetic information and Electronic Health
Records or clinical trial data; finally, we present a flexible infinite latent factor model of international
trade data to understand the underlying economic structure of countries and their evolution over time.Uno de los principales desafÃos de nuestro tiempo es encontrar sentido dentro de los datos. Con la aparición de tecnologÃas como Internet, redes de sensores, o métodos de secuenciación profunda
del genoma, una verdadera explosión digital se ha visto desencadenada, afectando todos los campos cientÃficos, asà como nuestra vida diaria. Logros recientes como pueden ser los coches auto-dirigidos o programas que ganan a los seres humanos al milenario juego del Go, han demostrado con creces los posibles beneficios que podemos obtener de la explotación de datos, mayoritariamente en tareas
supervisadas bien definidas. No obstante, apenas hemos empezado con la exploración de datos y su verdadero entendimiento.
En verdad, los datos encierran información muy valiosa para responder a muchas de las preguntas
más importantes para la humanidad: ¿Cómo afecta el envejecimiento a nuestras aptitudes fÃsicas?
¿Cuáles son los mecanismos subyacentes del cáncer? ¿Qué factores explican la riqueza de ciertos
paÃses frente a otros? Si bien la mayorÃa de estas preguntas no pueden formularse como problemas
supervisados bien definidos, éstas pueden ser abordadas mediante esfuerzos de investigación
multidisciplinar que involucren modelos fáciles de interpretar y análisis exploratorios rigurosos. Explorar los datos de manera eficiente abre potencialmente la puerta a un sinnúmero de descubrimientos
cientÃficos en diversas áreas con impacto real en nuestras vidas, descubrimientos que a su vez pueden llevarnos a una mejor explotación de los datos, resultando en recomendaciones polÃticas adecuadas, sistemas precisos de toma de decisión, protocolos médicos optimizados o modelos con mejores capacidades
predictivas. Esta tesis propone modelos Bayesianos no-paramétricos (BNP) adecuados para la resolución especÃfica de tareas explorativas de los datos en diversos ámbitos cientÃficos incluyendo ciencias del
deporte, investigación contra el cáncer, o economÃa. Recurrimos a un planteamiento BNP para facilitar el descubrimiento de patrones ocultos inesperados subyacentes en los datos. Los modelos
BNP definen una distribución a priori sobre un espacio de parámetros de dimensión infinita, lo cual los hace especialmente atractivos para enfoques probabilÃsticos donde el número de parámetros latentes
es en principio desconocido. Bajo dicha distribución a priori, la distribución a posteriori de los parámetros ocultos dados los datos asignará mayor probabilidad a aquellas configuraciones que
mejor explican las observaciones. De esta manera, la inferencia sobre el espacio de variables ocultas puede realizarse mediante técnicas estándar de inferencia Bayesiana, evitando el proceso de selección
de modelos. Esta tesis se centra en el ámbito de las aplicaciones, y es de naturaleza multidisciplinar. En
concreto, proponemos un sistema de gradación automática para comparar el rendimiento deportivo
de atletas independientemente de su edad o género, asà como de otros factores del entorno. Desarrollamos
modelos BNP para descubrir asociaciones genéticas y biomarcadores dentro de la investigación contra el cáncer, ya sea contrastando información genética con la historia clÃnica electrónica
de los pacientes, o utilizando datos de ensayos clÃnicos; finalmente, presentamos un modelo flexible
de factores latentes infinito para datos de comercio internacional, con el objetivo de entender la
estructura económica de los distintos paÃses y su correspondiente evolución a lo largo del tiempo.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: JoaquÃn MÃguez Arenas.- Secretario: Daniel Hernández Lobato.- Vocal: Cédric Archambea
Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes
The vast amount of biological knowledge accumulated over the years has
allowed researchers to identify various biochemical interactions and define
different families of pathways. There is an increased interest in identifying
pathways and pathway elements involved in particular biological processes. Drug
discovery efforts, for example, are focused on identifying biomarkers as well
as pathways related to a disease. We propose a Bayesian model that addresses
this question by incorporating information on pathways and gene networks in the
analysis of DNA microarray data. Such information is used to define pathway
summaries, specify prior distributions, and structure the MCMC moves to fit the
model. We illustrate the method with an application to gene expression data
with censored survival outcomes. In addition to identifying markers that would
have been missed otherwise and improving prediction accuracy, the integration
of existing biological knowledge into the analysis provides a better
understanding of underlying molecular processes.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS463 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Algorithms and architectures for MCMC acceleration in FPGAs
Markov Chain Monte Carlo (MCMC) is a family of stochastic algorithms which are used to draw random samples from arbitrary probability distributions. This task is necessary to solve a variety of problems in Bayesian modelling, e.g. prediction and model comparison, making MCMC a fundamental tool in modern statistics. Nevertheless, due to the increasing complexity of Bayesian models, the explosion in the amount of data they need to handle and the computational intensity of many MCMC algorithms, performing MCMC-based inference is often impractical in real applications. This thesis tackles this computational problem by proposing Field Programmable Gate Array (FPGA) architectures for accelerating MCMC and by designing novel MCMC algorithms and optimization methodologies which are tailored for FPGA implementation. The contributions of this work include: 1) An FPGA architecture for the Population-based MCMC algorithm, along with two modified versions of the algorithm which use custom arithmetic precision in large parts of the implementation without introducing error in the output. Mapping the two modified versions to an FPGA allows for more parallel modules to be instantiated in the same chip area. 2) An FPGA architecture for the Particle MCMC algorithm, along with a novel algorithm which combines Particle MCMC and Population-based MCMC to tackle multi-modal distributions. A proposed FPGA architecture for the new algorithm achieves higher datapath utilization than the Particle MCMC architecture. 3) A generic method to optimize the arithmetic precision of any MCMC algorithm that is implemented on FPGAs. The method selects the minimum precision among a given set of precisions, while guaranteeing a user-defined bound on the output error. By applying the above techniques to large-scale Bayesian problems, it is shown that significant speedups (one or two orders of magnitude) are possible compared to state-of-the-art MCMC algorithms implemented on CPUs and GPUs, opening the way for handling complex statistical analyses in the era of ubiquitous, ever-increasing data.Open Acces
Probabilistic analysis of the human transcriptome with side information
Understanding functional organization of genetic information is a major
challenge in modern biology. Following the initial publication of the human
genome sequence in 2001, advances in high-throughput measurement technologies
and efficient sharing of research material through community databases have
opened up new views to the study of living organisms and the structure of life.
In this thesis, novel computational strategies have been developed to
investigate a key functional layer of genetic information, the human
transcriptome, which regulates the function of living cells through protein
synthesis. The key contributions of the thesis are general exploratory tools
for high-throughput data analysis that have provided new insights to
cell-biological networks, cancer mechanisms and other aspects of genome
function.
A central challenge in functional genomics is that high-dimensional genomic
observations are associated with high levels of complex and largely unknown
sources of variation. By combining statistical evidence across multiple
measurement sources and the wealth of background information in genomic data
repositories it has been possible to solve some the uncertainties associated
with individual observations and to identify functional mechanisms that could
not be detected based on individual measurement sources. Statistical learning
and probabilistic models provide a natural framework for such modeling tasks.
Open source implementations of the key methodological contributions have been
released to facilitate further adoption of the developed methods by the
research community.Comment: Doctoral thesis. 103 pages, 11 figure
- …