113 research outputs found
Measuring the Influence of Observations in HMMs through the Kullback-Leibler Distance
We measure the influence of individual observations on the sequence of the
hidden states of the Hidden Markov Model (HMM) by means of the Kullback-Leibler
distance (KLD). Namely, we consider the KLD between the conditional
distribution of the hidden states' chain given the complete sequence of
observations and the conditional distribution of the hidden chain given all the
observations but the one under consideration. We introduce a linear complexity
algorithm for computing the influence of all the observations. As an
illustration, we investigate the application of our algorithm to the problem of
detecting outliers in HMM data series
An adaptive Ridge procedure for L0 regularization
Penalized selection criteria like AIC or BIC are among the most popular
methods for variable selection. Their theoretical properties have been studied
intensively and are well understood, but making use of them in case of
high-dimensional data is difficult due to the non-convex optimization problem
induced by L0 penalties. An elegant solution to this problem is provided by the
multi-step adaptive lasso, where iteratively weighted lasso problems are
solved, whose weights are updated in such a way that the procedure converges
towards selection with L0 penalties. In this paper we introduce an adaptive
ridge procedure (AR) which mimics the adaptive lasso, but is based on weighted
Ridge problems. After introducing AR its theoretical properties are studied in
the particular case of orthogonal linear regression. For the non-orthogonal
case extensive simulations are performed to assess the performance of AR. In
case of Poisson regression and logistic regression it is illustrated how the
iterative procedure of AR can be combined with iterative maximization
procedures. The paper ends with an efficient implementation of AR in the
context of least-squares segmentation
Sparse approaches for the exact distribution of patterns in long state sequences generated by a Markov source
We present two novel approaches for the computation of the exact distribution
of a pattern in a long sequence. Both approaches take into account the sparse
structure of the problem and are two-part algorithms. The first approach relies
on a partial recursion after a fast computation of the second largest
eigenvalue of the transition matrix of a Markov chain embedding. The second
approach uses fast Taylor expansions of an exact bivariate rational
reconstruction of the distribution. We illustrate the interest of both
approaches on a simple toy-example and two biological applications: the
transcription factors of the Human Chromosome 5 and the PROSITE signatures of
functional motifs in proteins. On these example our methods demonstrate their
complementarity and their hability to extend the domain of feasibility for
exact computations in pattern problems to a new level
Waiting Time Distribution for Pattern Occurrence in a Constrained Sequence: an Embedding Markov Chain Approach
Analysis of Algorithm
Alternative Methods for H1 Simulations in Genome Wide Association Studies
Assessing the statistical power to detect susceptibility variants plays a
critical role in GWA studies both from the prospective and retrospective points
of view. Power is empirically estimated by simulating phenotypes under a
disease model H1. For this purpose, the "gold" standard consists in simulating
genotypes given the phenotypes (e.g. Hapgen). We introduce here an alternative
approach for simulating phenotypes under H1 that does not require generating
new genotypes for each simulation. In order to simulate phenotypes with a fixed
total number of cases and under a given disease model, we suggest three
algorithms: i) a simple rejection algorithm; ii) a numerical Markov Chain
Monte-Carlo (MCMC) approach; iii) and an exact and efficient backward sampling
algorithm. In our study, we validated the three algorithms both on a
toy-dataset and by comparing them with Hapgen on a more realistic dataset. As
an application, we then conducted a simulation study on a 1000 Genomes Project
dataset consisting of 629 individuals (314 cases) and 8,048 SNPs from
Chromosome X. We arbitrarily defined an additive disease model with two
susceptibility SNPs and an epistatic effect. The three algorithms are
consistent, but backward sampling is dramatically faster than the other two.
Our approach also gives consistent results with Hapgen. Using our application
data, we showed that our limited design requires a biological a priori to limit
the investigated region. We also proved that epistatic effects can play a
significant role even when simple marker statistics (e.g. trend) are used. We
finally showed that the overall performance of a GWA study strongly depends on
the prevalence of the disease: the larger the prevalence, the better the power
- …