563 research outputs found
Semi-automatic selection of summary statistics for ABC model choice
A central statistical goal is to choose between alternative explanatory
models of data. In many modern applications, such as population genetics, it is
not possible to apply standard methods based on evaluating the likelihood
functions of the models, as these are numerically intractable. Approximate
Bayesian computation (ABC) is a commonly used alternative for such situations.
ABC simulates data x for many parameter values under each model, which is
compared to the observed data xobs. More weight is placed on models under which
S(x) is close to S(xobs), where S maps data to a vector of summary statistics.
Previous work has shown the choice of S is crucial to the efficiency and
accuracy of ABC. This paper provides a method to select good summary statistics
for model choice. It uses a preliminary step, simulating many x values from all
models and fitting regressions to this with the model as response. The
resulting model weight estimators are used as S in an ABC analysis. Theoretical
results are given to justify this as approximating low dimensional sufficient
statistics. A substantive application is presented: choosing between competing
coalescent models of demographic growth for Campylobacter jejuni in New Zealand
using multi-locus sequence typing data
Optimal detection of changepoints with a linear computational cost
We consider the problem of detecting multiple changepoints in large data
sets. Our focus is on applications where the number of changepoints will
increase as we collect more data: for example in genetics as we analyse larger
regions of the genome, or in finance as we observe time-series over longer
periods. We consider the common approach of detecting changepoints through
minimising a cost function over possible numbers and locations of changepoints.
This includes several established procedures for detecting changing points,
such as penalised likelihood and minimum description length. We introduce a new
method for finding the minimum of such cost functions and hence the optimal
number and location of changepoints that has a computational cost which, under
mild conditions, is linear in the number of observations. This compares
favourably with existing methods for the same problem whose computational cost
can be quadratic or even cubic. In simulation studies we show that our new
method can be orders of magnitude faster than these alternative exact methods.
We also compare with the Binary Segmentation algorithm for identifying
changepoints, showing that the exactness of our approach can lead to
substantial improvements in the accuracy of the inferred segmentation of the
data.Comment: 25 pages, 4 figures, To appear in Journal of the American Statistical
Associatio
Particle Approximations of the Score and Observed Information Matrix for Parameter Estimation in State Space Models With Linear Computational Cost
Poyiadjis et al. (2011) show how particle methods can be used to estimate both the score and the observed information matrix for state space models. These methods either suffer from a computational cost that is quadratic in the number of particles, or produce estimates whose variance increases quadratically with the amount of data. This paper introduces an alternative approach for estimating these terms at a computational cost that is linear in the number of particles. The method is derived using a combination of kernel density estimation, to avoid the particle degeneracy that causes the quadratically increasing variance, and Rao-Blackwellisation. Crucially, we show the method is robust to the choice of bandwidth within the kernel density estimation, as it has good asymptotic properties regardless of this choice. Our estimates of the score and observed information matrix can be used within both online and batch procedures for estimating parameters for state space models. Empirical results show improved parameter estimates compared to existing methods at a significantly reduced computational cost. Supplementary materials including code are available
The Time Machine: A Simulation Approach for Stochastic Trees
In the following paper we consider a simulation technique for stochastic
trees. One of the most important areas in computational genetics is the
calculation and subsequent maximization of the likelihood function associated
to such models. This typically consists of using importance sampling (IS) and
sequential Monte Carlo (SMC) techniques. The approach proceeds by simulating
the tree, backward in time from observed data, to a most recent common ancestor
(MRCA). However, in many cases, the computational time and variance of
estimators are often too high to make standard approaches useful. In this paper
we propose to stop the simulation, subsequently yielding biased estimates of
the likelihood surface. The bias is investigated from a theoretical point of
view. Results from simulation studies are also given to investigate the balance
between loss of accuracy, saving in computing time and variance reduction.Comment: 22 Pages, 5 Figure
Bayesian computation via empirical likelihood
Approximate Bayesian computation (ABC) has become an essential tool for the
analysis of complex stochastic models when the likelihood function is
numerically unavailable. However, the well-established statistical method of
empirical likelihood provides another route to such settings that bypasses
simulations from the model and the choices of the ABC parameters (summary
statistics, distance, tolerance), while being convergent in the number of
observations. Furthermore, bypassing model simulations may lead to significant
time savings in complex models, for instance those found in population
genetics. The BCel algorithm we develop in this paper also provides an
evaluation of its own performance through an associated effective sample size.
The method is illustrated using several examples, including estimation of
standard distributions, time series, and population genetics models.Comment: 21 pages, 12 figures, revised version of the previous version with a
new titl
INTEGRAL/SPI data segmentation to retrieve sources intensity variations
International audienceContext. The INTEGRAL/SPI, X/γ-ray spectrometer (20 keV–8 MeV) is an instrument for which recovering source intensity variations is not straightforward and can constitute a difficulty for data analysis. In most cases, determining the source intensity changes between exposures is largely based on a priori information.Aims. We propose techniques that help to overcome the difficulty related to source intensity variations, which make this step more rational. In addition, the constructed “synthetic” light curves should permit us to obtain a sky model that describes the data better and optimizes the source signal-to-noise ratios.Methods. For this purpose, the time intensity variation of each source was modeled as a combination of piecewise segments of time during which a given source exhibits a constant intensity. To optimize the signal-to-noise ratios, the number of segments was minimized. We present a first method that takes advantage of previous time series that can be obtained from another instrument on-board the INTEGRAL observatory. A data segmentation algorithm was then used to synthesize the time series into segments. The second method no longer needs external light curves, but solely SPI raw data. For this, we developed a specific algorithm that involves the SPI transfer function.Results. The time segmentation algorithms that were developed solve a difficulty inherent to the SPI instrument, which is the intensity variations of sources between exposures, and it allows us to obtain more information about the sources’ behavior
Sequential quasi-Monte Carlo: Introduction for Non-Experts, Dimension Reduction, Application to Partly Observed Diffusion Processes
SMC (Sequential Monte Carlo) is a class of Monte Carlo algorithms for
filtering and related sequential problems. Gerber and Chopin (2015) introduced
SQMC (Sequential quasi-Monte Carlo), a QMC version of SMC. This paper has two
objectives: (a) to introduce Sequential Monte Carlo to the QMC community, whose
members are usually less familiar with state-space models and particle
filtering; (b) to extend SQMC to the filtering of continuous-time state-space
models, where the latent process is a diffusion. A recurring point in the paper
will be the notion of dimension reduction, that is how to implement SQMC in
such a way that it provides good performance despite the high dimension of the
problem.Comment: To be published in the proceedings of MCMQMC 201
A computationally efficient, high-dimensional multiple changepoint procedure with application to global terrorism incidence
Detecting changepoints in datasets with many variates is a data science challenge of increasing importance. Motivated by the problem of detecting changes in the incidence of terrorism from a global terrorism database, we propose a novel approach to multiple changepoint detection in multivariate time series. Our method, which we call SUBSET, is a model-based approach which uses a penalised likelihood to detect changes for a wide class of parametric settings. We provide theory that guides the choice of penalties to use for SUBSET, and that shows it has high power to detect changes regardless of whether only a few variates or many variates change. Empirical results show that SUBSET out-performs many existing approaches for detecting changes in mean in Gaussian data; additionally, unlike these alternative methods, it can be easily extended to non-Gaussian settings such as are appropriate for modelling counts of terrorist events
- …