Search CORE

1,574 research outputs found

Computational aspects of DNA mixture analysis

Author: Graversen Therese
Lauritzen Steffen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Statistical analysis of DNA mixtures is known to pose computational challenges due to the enormous state space of possible DNA profiles. We propose a Bayesian network representation for genotypes, allowing computations to be performed locally involving only a few alleles at each step. In addition, we describe a general method for computing the expectation of a product of discrete random variables using auxiliary variables and probability propagation in a Bayesian network, which in combination with the genotype network allows efficient computation of the likelihood function and various other quantities relevant to the inference. Lastly, we introduce a set of diagnostic tools for assessing the adequacy of the model for describing a particular dataset

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Recommended from our members

STATISTICAL METHODS TO STUDY TRANSPOSON SEQUENCING DATA: NONPARAMETRIC BAYESIAN MODELS WITH SAMPLING ALGORITHMS

Author: He Shai
Publication venue: ScholarWorks@UMass Amherst
Publication date: 26/10/2022
Field of study

As the development of Next Generation Sequencing(NGS) technology, researchers can easily obtain data from millions of cells( bulk samples) or just collecting data from a single cell. However, while bulk samples can capture broad changes, it may risk providing an average measurement that is not representative of the genetic state of any individual cell. While single-cell experiments can capture the genetic state of the individual cell, a single cell sample can increase uncertainty, sampling enough cells to gain a representative sample of population is expensive. Therefore, there is a need to integrate information from both bulk and single-cell data to obtain a comprehensive understanding of subclonal populations in an individual tumor as well as across individuals. The goal of this work is to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data. We propose a hierarchical Dirichlet process mixture model that incorporates the correlation structure induced by a structured sampling arrangement and we show that this model improves the quality of inference. We develop a representation of the hierarchical Dirichlet process prior as a Gamma-Poisson hierarchy and we use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method. Experiments shows that our model outperforms state-of-the-art methods Another goal for analyzing genomic data is to understand which genes are essential and under what environmental conditions they are essential. Transposon sequencing method provides a powerful tool for researchers to find conditionally essential genes. However, methods are needed to go beyond a one-at-a-time analysis of conditionally essential genes and learn higher order representations that identify conditionally essential networks of genes. While the aforementioned methods do identify essential genes from transposon sequencing data, they do not provide a representation of the space of essential genes. For example, if two genes share the same pattern of essentiality across all conditions there is a higher level representation that couples those genes into a network. The goal of this work is to build such a higher level representations of the set of essential genes and identify genes that share essentiality patterns across conditions. To address this need, we develop a novel, computationally efficient hierarchical non-parametric Bayesian model: hierarchical Gamma-Poisson Process (hGP)

ScholarWorks@UMass Amherst

Bayesian Cointegrated Vector Autoregression models incorporating Alpha-stable noise for inter-day price movements via Approximate Bayesian Computation

Author: Godsill S
Kannan B
Lasscock B
Melleny C
Peters GW
Publication venue
Publication date: 01/08/2010
Field of study

We consider a statistical model for pairs of traded assets, based on a Cointegrated Vector Auto Regression (CVAR) Model. We extend standard CVAR models to incorporate estimation of model parameters in the presence of price series level shifts which are not accurately modeled in the standard Gaussian error correction model (ECM) framework. This involves developing a novel matrix variate Bayesian CVAR mixture model comprised of Gaussian errors intra-day and Alpha-stable errors inter-day in the ECM framework. To achieve this we derive a novel conjugate posterior model for the Scaled Mixtures of Normals (SMiN CVAR) representation of Alpha-stable inter-day innovations. These results are generalized to asymmetric models for the innovation noise at inter-day boundaries allowing for skewed Alpha-stable models. Our proposed model and sampling methodology is general, incorporating the current literature on Gaussian models as a special subclass and also allowing for price series level shifts either at random estimated time points or known a priori time points. We focus analysis on regularly observed non-Gaussian level shifts that can have significant effect on estimation performance in statistical models failing to account for such level shifts, such as at the close and open of markets. We compare the estimation accuracy of our model and estimation approach to standard frequentist and Bayesian procedures for CVAR models when non-Gaussian price series level shifts are present in the individual series, such as inter-day boundaries. We fit a bi-variate Alpha-stable model to the inter-day jumps and model the effect of such jumps on estimation of matrix-variate CVAR model parameters using the likelihood based Johansen procedure and a Bayesian estimation. We illustrate our model and the corresponding estimation procedures we develop on both synthetic and actual data.Comment: 30 page

arXiv.org e-Print Archive

Heriot Watt Pure

Crossref

CUED - Cambridge University Engineering Department

Bayesian statistics and modelling

Author: Depaoli Sarah
Gelman Andrew
Kramer Bianca
Leerstoel Schoot
Methodology and statistics for the behavioural and social sciences
Märtens Kaspar
Tadesse Mahlet G.
van de Schoot R.
Vannucci Marina
Veen Duco
Willemsen Joukje
Yau Christopher
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/01/2021
Field of study

Bayesian statistics is an approach to data analysis based on Bayes’ theorem, where available knowledge about parameters in a statistical model is updated with the information in observed data. The background knowledge is expressed as a prior distribution and combined with observational data in the form of a likelihood function to determine the posterior distribution. The posterior can also be used for making predictions about future events. This Primer describes the stages involved in Bayesian analysis, from specifying the prior and data models to deriving inference, model checking and refinement. We discuss the importance of prior and posterior predictive checking, selecting a proper technique for sampling from a posterior distribution, variational inference and variable selection. Examples of successful applications of Bayesian analysis across various research fields are provided, including in social sciences, ecology, genetics, medicine and more. We propose strategies for reproducibility and reporting standards, outlining an updated WAMBS (when to Worry and how to Avoid the Misuse of Bayesian Statistics) checklist. Finally, we outline the impact of Bayesian analysis on artificial intelligence, a major goal in the next decade

Edinburgh Research Explorer

Utrecht University Repository

Bayesian nonparametric models for data exploration

Author: Fernández Pradier Mélanie
Publication venue
Publication date: 01/01/2017
Field of study

Mención Internacional en el título de doctorMaking sense out of data is one of the biggest challenges of our time. With the emergence of technologies such as the Internet, sensor networks or deep genome sequencing, a true data explosion has been unleashed that affects all fields of science and our everyday life. Recent breakthroughs, such as self-driven cars or champion-level Go player programs, have demonstrated the potential benefits from exploiting data, mostly in well-defined supervised tasks. However, we have barely started to actually explore and truly understand data. In fact, data holds valuable information for answering most important questions for humanity: How does aging impact our physical capabilities? What are the underlying mechanisms of cancer? Which factors make countries wealthier than others? Most of these questions cannot be stated as well-defined supervised problems, and might benefit enormously from multidisciplinary research efforts involving easy-to-interpret models and rigorous data exploratory analyses. Efficient data exploration might lead to life-changing scientific discoveries, which can later be turned into a more impactful exploitation phase, to put forward more informed policy recommendations, decision-making systems, medical protocols or improved models for highly accurate predictions. This thesis proposes tailored Bayesian nonparametric (BNP) models to solve specific data exploratory tasks across different scientific areas including sport sciences, cancer research, and economics. We resort to BNP approaches to facilitate the discovery of unexpected hidden patterns within data. BNP models place a prior distribution over an infinite-dimensional parameter space, which makes them particularly useful in probabilistic models where the number of hidden parameters is unknown a priori. Under this prior distribution, the posterior distribution of the hidden parameters given the data will assign high probability mass to those configurations that best explain the observations. Hence, inference over the hidden variables can be performed using standard Bayesian inference techniques, therefore avoiding expensive model selection steps. This thesis is application-focused and highly multidisciplinary. More precisely, we propose an automatic grading system for sportive competitions to compare athletic performance regardless of age, gender and environmental aspects; we develop BNP models to perform genetic association and biomarker discovery in cancer research, either using genetic information and Electronic Health Records or clinical trial data; finally, we present a flexible infinite latent factor model of international trade data to understand the underlying economic structure of countries and their evolution over time.Uno de los principales desafíos de nuestro tiempo es encontrar sentido dentro de los datos. Con la aparición de tecnologías como Internet, redes de sensores, o métodos de secuenciación profunda del genoma, una verdadera explosión digital se ha visto desencadenada, afectando todos los campos científicos, así como nuestra vida diaria. Logros recientes como pueden ser los coches auto-dirigidos o programas que ganan a los seres humanos al milenario juego del Go, han demostrado con creces los posibles beneficios que podemos obtener de la explotación de datos, mayoritariamente en tareas supervisadas bien definidas. No obstante, apenas hemos empezado con la exploración de datos y su verdadero entendimiento. En verdad, los datos encierran información muy valiosa para responder a muchas de las preguntas más importantes para la humanidad: ¿Cómo afecta el envejecimiento a nuestras aptitudes físicas? ¿Cuáles son los mecanismos subyacentes del cáncer? ¿Qué factores explican la riqueza de ciertos países frente a otros? Si bien la mayoría de estas preguntas no pueden formularse como problemas supervisados bien definidos, éstas pueden ser abordadas mediante esfuerzos de investigación multidisciplinar que involucren modelos fáciles de interpretar y análisis exploratorios rigurosos. Explorar los datos de manera eficiente abre potencialmente la puerta a un sinnúmero de descubrimientos científicos en diversas áreas con impacto real en nuestras vidas, descubrimientos que a su vez pueden llevarnos a una mejor explotación de los datos, resultando en recomendaciones políticas adecuadas, sistemas precisos de toma de decisión, protocolos médicos optimizados o modelos con mejores capacidades predictivas. Esta tesis propone modelos Bayesianos no-paramétricos (BNP) adecuados para la resolución específica de tareas explorativas de los datos en diversos ámbitos científicos incluyendo ciencias del deporte, investigación contra el cáncer, o economía. Recurrimos a un planteamiento BNP para facilitar el descubrimiento de patrones ocultos inesperados subyacentes en los datos. Los modelos BNP definen una distribución a priori sobre un espacio de parámetros de dimensión infinita, lo cual los hace especialmente atractivos para enfoques probabilísticos donde el número de parámetros latentes es en principio desconocido. Bajo dicha distribución a priori, la distribución a posteriori de los parámetros ocultos dados los datos asignará mayor probabilidad a aquellas configuraciones que mejor explican las observaciones. De esta manera, la inferencia sobre el espacio de variables ocultas puede realizarse mediante técnicas estándar de inferencia Bayesiana, evitando el proceso de selección de modelos. Esta tesis se centra en el ámbito de las aplicaciones, y es de naturaleza multidisciplinar. En concreto, proponemos un sistema de gradación automática para comparar el rendimiento deportivo de atletas independientemente de su edad o género, así como de otros factores del entorno. Desarrollamos modelos BNP para descubrir asociaciones genéticas y biomarcadores dentro de la investigación contra el cáncer, ya sea contrastando información genética con la historia clínica electrónica de los pacientes, o utilizando datos de ensayos clínicos; finalmente, presentamos un modelo flexible de factores latentes infinito para datos de comercio internacional, con el objetivo de entender la estructura económica de los distintos países y su correspondiente evolución a lo largo del tiempo.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Joaquín Míguez Arenas.- Secretario: Daniel Hernández Lobato.- Vocal: Cédric Archambea

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes

Author: Chen Yian A.
Stingo Francesco C.
Tadesse Mahlet G.
Vannucci Marina
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2010
Field of study

The vast amount of biological knowledge accumulated over the years has allowed researchers to identify various biochemical interactions and define different families of pathways. There is an increased interest in identifying pathways and pathway elements involved in particular biological processes. Drug discovery efforts, for example, are focused on identifying biomarkers as well as pathways related to a disease. We propose a Bayesian model that addresses this question by incorporating information on pathways and gene networks in the analysis of DNA microarray data. Such information is used to define pathway summaries, specify prior distributions, and structure the MCMC moves to fit the model. We illustrate the method with an application to gene expression data with censored survival outcomes. In addition to identifying markers that would have been missed otherwise and improving prediction accuracy, the integration of existing biological knowledge into the analysis provides a better understanding of underlying molecular processes.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS463 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

Algorithms and architectures for MCMC acceleration in FPGAs

Author: Mingas Grigorios
Publication venue: Electrical and Electronic Engineering, Imperial College London
Publication date: 01/03/2016
Field of study

Markov Chain Monte Carlo (MCMC) is a family of stochastic algorithms which are used to draw random samples from arbitrary probability distributions. This task is necessary to solve a variety of problems in Bayesian modelling, e.g. prediction and model comparison, making MCMC a fundamental tool in modern statistics. Nevertheless, due to the increasing complexity of Bayesian models, the explosion in the amount of data they need to handle and the computational intensity of many MCMC algorithms, performing MCMC-based inference is often impractical in real applications. This thesis tackles this computational problem by proposing Field Programmable Gate Array (FPGA) architectures for accelerating MCMC and by designing novel MCMC algorithms and optimization methodologies which are tailored for FPGA implementation. The contributions of this work include: 1) An FPGA architecture for the Population-based MCMC algorithm, along with two modified versions of the algorithm which use custom arithmetic precision in large parts of the implementation without introducing error in the output. Mapping the two modified versions to an FPGA allows for more parallel modules to be instantiated in the same chip area. 2) An FPGA architecture for the Particle MCMC algorithm, along with a novel algorithm which combines Particle MCMC and Population-based MCMC to tackle multi-modal distributions. A proposed FPGA architecture for the new algorithm achieves higher datapath utilization than the Particle MCMC architecture. 3) A generic method to optimize the arithmetic precision of any MCMC algorithm that is implemented on FPGAs. The method selects the minimum precision among a given set of precisions, while guaranteeing a user-defined bound on the output error. By applying the above techniques to large-scale Bayesian problems, it is shown that significant speedups (one or two orders of magnitude) are possible compared to state-of-the-art MCMC algorithms implemented on CPUs and GPUs, opening the way for handling complex statistical analyses in the era of ubiquitous, ever-increasing data.Open Acces

Spiral - Imperial College Digital Repository

Probabilistic analysis of the human transcriptome with side information

Author: Lahti Leo
Publication venue
Publication date: 01/01/2010
Field of study

Understanding functional organization of genetic information is a major challenge in modern biology. Following the initial publication of the human genome sequence in 2001, advances in high-throughput measurement technologies and efficient sharing of research material through community databases have opened up new views to the study of living organisms and the structure of life. In this thesis, novel computational strategies have been developed to investigate a key functional layer of genetic information, the human transcriptome, which regulates the function of living cells through protein synthesis. The key contributions of the thesis are general exploratory tools for high-throughput data analysis that have provided new insights to cell-biological networks, cancer mechanisms and other aspects of genome function. A central challenge in functional genomics is that high-dimensional genomic observations are associated with high levels of complex and largely unknown sources of variation. By combining statistical evidence across multiple measurement sources and the wealth of background information in genomic data repositories it has been possible to solve some the uncertainties associated with individual observations and to identify functional mechanisms that could not be detected based on individual measurement sources. Statistical learning and probabilistic models provide a natural framework for such modeling tasks. Open source implementations of the key methodological contributions have been released to facilitate further adoption of the developed methods by the research community.Comment: Doctoral thesis. 103 pages, 11 figure

arXiv.org e-Print Archive

Aaltodoc Publication Archive