113 research outputs found

    Measuring the Influence of Observations in HMMs through the Kullback-Leibler Distance

    Full text link
    We measure the influence of individual observations on the sequence of the hidden states of the Hidden Markov Model (HMM) by means of the Kullback-Leibler distance (KLD). Namely, we consider the KLD between the conditional distribution of the hidden states' chain given the complete sequence of observations and the conditional distribution of the hidden chain given all the observations but the one under consideration. We introduce a linear complexity algorithm for computing the influence of all the observations. As an illustration, we investigate the application of our algorithm to the problem of detecting outliers in HMM data series

    An adaptive Ridge procedure for L0 regularization

    Get PDF
    Penalized selection criteria like AIC or BIC are among the most popular methods for variable selection. Their theoretical properties have been studied intensively and are well understood, but making use of them in case of high-dimensional data is difficult due to the non-convex optimization problem induced by L0 penalties. An elegant solution to this problem is provided by the multi-step adaptive lasso, where iteratively weighted lasso problems are solved, whose weights are updated in such a way that the procedure converges towards selection with L0 penalties. In this paper we introduce an adaptive ridge procedure (AR) which mimics the adaptive lasso, but is based on weighted Ridge problems. After introducing AR its theoretical properties are studied in the particular case of orthogonal linear regression. For the non-orthogonal case extensive simulations are performed to assess the performance of AR. In case of Poisson regression and logistic regression it is illustrated how the iterative procedure of AR can be combined with iterative maximization procedures. The paper ends with an efficient implementation of AR in the context of least-squares segmentation

    Sparse approaches for the exact distribution of patterns in long state sequences generated by a Markov source

    Get PDF
    We present two novel approaches for the computation of the exact distribution of a pattern in a long sequence. Both approaches take into account the sparse structure of the problem and are two-part algorithms. The first approach relies on a partial recursion after a fast computation of the second largest eigenvalue of the transition matrix of a Markov chain embedding. The second approach uses fast Taylor expansions of an exact bivariate rational reconstruction of the distribution. We illustrate the interest of both approaches on a simple toy-example and two biological applications: the transcription factors of the Human Chromosome 5 and the PROSITE signatures of functional motifs in proteins. On these example our methods demonstrate their complementarity and their hability to extend the domain of feasibility for exact computations in pattern problems to a new level

    Waiting Time Distribution for Pattern Occurrence in a Constrained Sequence: an Embedding Markov Chain Approach

    Get PDF
    Analysis of Algorithm

    Alternative Methods for H1 Simulations in Genome Wide Association Studies

    Full text link
    Assessing the statistical power to detect susceptibility variants plays a critical role in GWA studies both from the prospective and retrospective points of view. Power is empirically estimated by simulating phenotypes under a disease model H1. For this purpose, the "gold" standard consists in simulating genotypes given the phenotypes (e.g. Hapgen). We introduce here an alternative approach for simulating phenotypes under H1 that does not require generating new genotypes for each simulation. In order to simulate phenotypes with a fixed total number of cases and under a given disease model, we suggest three algorithms: i) a simple rejection algorithm; ii) a numerical Markov Chain Monte-Carlo (MCMC) approach; iii) and an exact and efficient backward sampling algorithm. In our study, we validated the three algorithms both on a toy-dataset and by comparing them with Hapgen on a more realistic dataset. As an application, we then conducted a simulation study on a 1000 Genomes Project dataset consisting of 629 individuals (314 cases) and 8,048 SNPs from Chromosome X. We arbitrarily defined an additive disease model with two susceptibility SNPs and an epistatic effect. The three algorithms are consistent, but backward sampling is dramatically faster than the other two. Our approach also gives consistent results with Hapgen. Using our application data, we showed that our limited design requires a biological a priori to limit the investigated region. We also proved that epistatic effects can play a significant role even when simple marker statistics (e.g. trend) are used. We finally showed that the overall performance of a GWA study strongly depends on the prevalence of the disease: the larger the prevalence, the better the power
    • …