256 research outputs found

    A reversible infinite HMM using normalised random measures

    Full text link
    We present a nonparametric prior over reversible Markov chains. We use completely random measures, specifically gamma processes, to construct a countably infinite graph with weighted edges. By enforcing symmetry to make the edges undirected we define a prior over random walks on graphs that results in a reversible Markov chain. The resulting prior over infinite transition matrices is closely related to the hierarchical Dirichlet process but enforces reversibility. A reinforcement scheme has recently been proposed with similar properties, but the de Finetti measure is not well characterised. We take the alternative approach of explicitly constructing the mixing measure, which allows more straightforward and efficient inference at the cost of no longer having a closed form predictive distribution. We use our process to construct a reversible infinite HMM which we apply to two real datasets, one from epigenomics and one ion channel recording.Comment: 9 pages, 6 figure

    Some contributions to model selection and statistical inference in Markovian models

    Get PDF
    The general theme of this thesis is providing and studying a new understanding of some statistical models and computational methods based on a Markov process/chain. Section 1-4 are devoted to reviewing the literature for the sake of completeness and the better understanding of Section 5-7 that are our original studies. Section 1 is devoted to understanding a Markov process since continuous and discrete types of a Markov process are hinges of the thesis. In particular, we will study some basics/advanced results of Markov chains and Ito diffusions. Ergodic properties of these processes are also documented. In Section 2 we first study the Metropolis-Hastings algorithm since this is basic of other MCMC methods. We then study more advanced methods such as Reversible Jump MCMC, Metropolis-adjusted Langevin algorithm, pseudo marginal MCMC and Hamiltonian Monte Carlo. These MCMC methods will appear in Section 3, 4 and 7. In Section 3 we consider another type of Monte Carlo method called sequential Monte Carlo (SMC). Unlike MCMC methods, SMC methods often give us on-line ways to approximate intractable objects. Therefore, these methods are particularly useful when one needs to play around with models with scalable computational costs. Some mathematical analysis of SMC also can be found. These SMC methods will appear in Section 4, 5, 6 and 7. In Section 4 we first discuss hidden Markov models (HMMs) since all statistical models that we consider in the thesis can be treated as HMMs or their generalisation. Since, in general, HMMs involve intractable objects, we then study approximation ways for them based on SMC methods. Statistical inference for HMMs is also considered. These topics will appear in Section 5, 6 and 7. Section 5 is largely based on a submitted paper titled Asymptotic Analysis of Model Selection Criteria for General Hidden Markov Models with Alexandros Beskos and Sumeetpal Sidhu Singh, https: //arxiv.org/abs/1811.11834v3. In this section, we study the asymptotic behaviour of some information criteria in the context of hidden Markov models, or state space models. In particular, we prove the strong consistency of BIC and evidence for general HMMs. Section 6 is largely based on a submitted paper titled Online Smoothing for Diffusion Processes Observed with Noise with Alexandros Beskos, https://arxiv.org/abs/2003.12247. In this section, we develop sequential Monte Carlo methods to estimate parameters of (jump) diffusion models. Section 7 is largely based on an ongoing paper titled Adaptive Bayesian Model Selection for Diffusion Models with Alexandros Beskos. In this section, we develop adaptive computational ways, based on sequential Monte Carlo samplers and Hamiltonian Monte Carlo on a functional space, for Bayesian model selection

    Improved Bayesian methods for detecting recombination and rate heterogeneity in DNA sequence alignments

    Get PDF
    DNA sequence alignments are usually not homogeneous. Mosaic structures may result as a consequence of recombination or rate heterogeneity. Interspecific recombination, in which DNA subsequences are transferred between different (typically viral or bacterial) strains may result in a change of the topology of the underlying phylogenetic tree. Rate heterogeneity corresponds to a change of the nucleotide substitution rate. Various methods for simultaneously detecting recombination and rate heterogeneity in DNA sequence alignments have recently been proposed, based on complex probabilistic models that combine phylogenetic trees with factorial hidden Markov models or multiple changepoint processes. The objective of my thesis is to identify potential shortcomings of these models and explore ways of how to improve them. One shortcoming that I have identified is related to an approximation made in various recently proposed Bayesian models. The Bayesian paradigm requires the solution of an integral over the space of parameters. To render this integration analytically tractable, these models assume that the vectors of branch lengths of the phylogenetic tree are independent among sites. While this approximation reduces the computational complexity considerably, I show that it leads to the systematic prediction of spurious topology changes in the Felsenstein zone, that is, the area in the branch lengths configuration space where maximum parsimony consistently infers the wrong topology due to long-branch attraction. I demonstrate these failures by using two Bayesian hypothesis tests, based on an inter- and an intra-model approach to estimating the marginal likelihood. I then propose a revised model that addresses these shortcomings, and demonstrate its improved performance on a set of synthetic DNA sequence alignments systematically generated around the Felsenstein zone. The core model explored in my thesis is a phylogenetic factorial hidden Markov model (FHMM) for detecting two types of mosaic structures in DNA sequence alignments, related to recombination and rate heterogeneity. The focus of my work is on improving the modelling of the latter aspect. Earlier research efforts by other authors have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. Their work fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code. I have improved these earlier phylogenetic FHMMs in two respects. Firstly, by sampling the rate vector from the posterior distribution with RJMCMC I have made the modelling of regional rate heterogeneity more flexible, and I infer the number of different degrees of divergence directly from the DNA sequence alignment, thereby dispensing with the need to arbitrarily select this quantity in advance. Secondly, I explicitly model within-codon rate heterogeneity via a separate rate modification vector. In this way, the within-codon effect of rate heterogeneity is imposed on the model a priori, which facilitates the learning of the biologically more interesting effect of regional rate heterogeneity a posteriori. I have carried out simulations on synthetic DNA sequence alignments, which have borne out my conjecture. The existing model, which does not explicitly include the within-codon rate variation, has to model both effects with the same modelling mechanism. As expected, it was found to fail to disentangle these two effects. On the contrary, I have found that my new model clearly separates within-codon rate variation from regional rate heterogeneity, resulting in more accurate predictions

    Bayesian regularization of the length of memory in reversible sequences

    Get PDF
    Variable order Markov chains have been used to model discrete sequential data in a variety of fields. A host of methods exist to estimate the history-dependent lengths of memory which characterize these models and to predict new sequences. In several applications, the data-generating mechanism is known to be reversible, but combining this information with the procedures mentioned is far from trivial. We introduce a Bayesian analysis for reversible dynamics, which takes into account uncertainty in the lengths of memory. The model proposed is applied to the analysis of molecular dynamics simulations and compared with several popular algorithms.SF is supported by the European Research Council through grant StG N-BNP 306406, LT has been supported by the Claudia Adams Barr Program in Innovative Cancer Research and SB received funding from the Stein Fellowship.This is the author accepted manuscript. The final version is available from Wiley via http://dx.doi.org/10.1111/rssb.1214

    On improving the forecast accuracy of the hidden Markov model

    Get PDF
    The forecast accuracy of a hidden Markov model (HMM) may be low due first, to the measure of forecast accuracy being ignored in the parameterestimation method and, second, to overfitting caused by the large number of parameters that must be estimated. A general approach to forecasting is described which aims to resolve these two problems and so improve the forecast accuracy of the HMM. First, the application of extremum estimators to the HMM is proposed. Extremum estimators aim to improve the forecast accuracy of the HMM by minimising an estimate of the forecast error on the observed data. The forecast accuracy is measured by a score function and the use of some general classes of score functions is proposed. This approach contrasts with the standard use of a minus log-likelihood score function. Second, penalised estimation for the HMM is described. The aim of penalised estimation is to reduce overfitting and so increase the forecast accuracy of the HMM. Penalties on both the state-dependent distribution parameters and transition probability matrix are proposed. In addition, a number of cross-validation approaches for tuning the penalty function are investigated. Empirical assessment of the proposed approach on both simulated and real data demonstrated that, in terms of forecast accuracy, penalised HMMs fitted using extremum estimators generally outperformed unpenalised HMMs fitted using maximum likelihood

    Simultaneous characterization of sense and antisense genomic processes by the double-stranded hidden Markov model

    Get PDF
    Hidden Markov models (HMMs) have been extensively used to dissect the genome into functionally distinct regions using data such as RNA expression or DNA binding measurements. It is a challenge to disentangle processes occurring on complementary strands of the same genomic region. We present the double-stranded HMM (dsHMM), a model for the strand-specific analysis of genomic processes. We applied dsHMM to yeast using strand specific transcription data, nucleosome data, and protein binding data for a set of 11 factors associated with the regulation of transcription. The resulting annotation recovers the mRNA transcription cycle (initiation, elongation, termination) while correctly predicting strand-specificity and directionality of the transcription process. We find that pre-initiation complex formation is an essentially undirected process, giving rise to a large number of bidirectional promoters and to pervasive antisense transcription. Notably, 12% of all transcriptionally active positions showed simultaneous activity on both strands. Furthermore, dsHMM reveals that antisense transcription is specifically suppressed by Nrd1, a yeast termination factor
    • …
    corecore