100,396 research outputs found
Why your model parameter confidences might be too optimistic -- unbiased estimation of the inverse covariance matrix
AIMS. The maximum-likelihood method is the standard approach to obtain model
fits to observational data and the corresponding confidence regions. We
investigate possible sources of bias in the log-likelihood function and its
subsequent analysis, focusing on estimators of the inverse covariance matrix.
Furthermore, we study under which circumstances the estimated covariance matrix
is invertible. METHODS. We perform Monte-Carlo simulations to investigate the
behaviour of estimators for the inverse covariance matrix, depending on the
number of independent data sets and the number of variables of the data
vectors. RESULTS. We find that the inverse of the maximum-likelihood estimator
of the covariance is biased, the amount of bias depending on the ratio of the
number of bins (data vector variables), P, to the number of data sets, N. This
bias inevitably leads to an -- in extreme cases catastrophic -- underestimation
of the size of confidence regions. We report on a method to remove this bias
for the idealised case of Gaussian noise and statistically independent data
vectors. Moreover, we demonstrate that marginalisation over parameters
introduces a bias into the marginalised log-likelihood function. Measures of
the sizes of confidence regions suffer from the same problem. Furthermore, we
give an analytic proof for the fact that the estimated covariance matrix is
singular if P>N.Comment: 6 pages, 3 figures, A&A, in press, shortened versio
Dimensionality reduction of clustered data sets
We present a novel probabilistic latent variable model to perform linear dimensionality reduction on data sets which contain clusters. We prove that the maximum likelihood solution of the model is an unsupervised generalisation of linear discriminant analysis. This provides a completely new approach to one of the most established and widely used classification algorithms. The performance of the model is then demonstrated on a number of real and artificial data sets
Some Aspects of Measurement Error in Linear Regression of Astronomical Data
I describe a Bayesian method to account for measurement errors in linear
regression of astronomical data. The method allows for heteroscedastic and
possibly correlated measurement errors, and intrinsic scatter in the regression
relationship. The method is based on deriving a likelihood function for the
measured data, and I focus on the case when the intrinsic distribution of the
independent variables can be approximated using a mixture of Gaussians. I
generalize the method to incorporate multiple independent variables,
non-detections, and selection effects (e.g., Malmquist bias). A Gibbs sampler
is described for simulating random draws from the probability distribution of
the parameters, given the observed data. I use simulation to compare the method
with other common estimators. The simulations illustrate that the Gaussian
mixture model outperforms other common estimators and can effectively give
constraints on the regression parameters, even when the measurement errors
dominate the observed scatter, source detection fraction is low, or the
intrinsic distribution of the independent variables is not a mixture of
Gaussians. I conclude by using this method to fit the X-ray spectral slope as a
function of Eddington ratio using a sample of 39 z < 0.8 radio-quiet quasars. I
confirm the correlation seen by other authors between the radio-quiet quasar
X-ray spectral slope and the Eddington ratio, where the X-ray spectral slope
softens as the Eddington ratio increases.Comment: 39 pages, 11 figures, 1 table, accepted by ApJ. IDL routines
(linmix_err.pro) for performing the Markov Chain Monte Carlo are available at
the IDL astronomy user's library, http://idlastro.gsfc.nasa.gov/homepage.htm
Binary Models for Marginal Independence
Log-linear models are a classical tool for the analysis of contingency
tables. In particular, the subclass of graphical log-linear models provides a
general framework for modelling conditional independences. However, with the
exception of special structures, marginal independence hypotheses cannot be
accommodated by these traditional models. Focusing on binary variables, we
present a model class that provides a framework for modelling marginal
independences in contingency tables. The approach taken is graphical and draws
on analogies to multivariate Gaussian models for marginal independence. For the
graphical model representation we use bi-directed graphs, which are in the
tradition of path diagrams. We show how the models can be parameterized in a
simple fashion, and how maximum likelihood estimation can be performed using a
version of the Iterated Conditional Fitting algorithm. Finally we consider
combining these models with symmetry restrictions
High-Dimensional Bayesian Geostatistics
With the growing capabilities of Geographic Information Systems (GIS) and
user-friendly software, statisticians today routinely encounter geographically
referenced data containing observations from a large number of spatial
locations and time points. Over the last decade, hierarchical spatiotemporal
process models have become widely deployed statistical tools for researchers to
better understand the complex nature of spatial and temporal variability.
However, fitting hierarchical spatiotemporal models often involves expensive
matrix computations with complexity increasing in cubic order for the number of
spatial locations and temporal points. This renders such models unfeasible for
large data sets. This article offers a focused review of two methods for
constructing well-defined highly scalable spatiotemporal stochastic processes.
Both these processes can be used as "priors" for spatiotemporal random fields.
The first approach constructs a low-rank process operating on a
lower-dimensional subspace. The second approach constructs a Nearest-Neighbor
Gaussian Process (NNGP) that ensures sparse precision matrices for its finite
realizations. Both processes can be exploited as a scalable prior embedded
within a rich hierarchical modeling framework to deliver full Bayesian
inference. These approaches can be described as model-based solutions for big
spatiotemporal datasets. The models ensure that the algorithmic complexity has
floating point operations (flops), where the number of spatial
locations (per iteration). We compare these methods and provide some insight
into their methodological underpinnings
Simultaneous likelihood-based bootstrap confidence sets for a large number of models
The paper studies a problem of constructing simultaneous likelihood-based
confidence sets. We consider a simultaneous multiplier bootstrap procedure for
estimating the quantiles of the joint distribution of the likelihood ratio
statistics, and for adjusting the confidence level for multiplicity.
Theoretical results state the bootstrap validity in the following setting: the
sample size is fixed, the maximal parameter dimension
and the number of considered parametric models are
s.t. is small. We also consider the situation
when the parametric models are misspecified. If the models' misspecification is
significant, then the bootstrap critical values exceed the true ones and the
simultaneous bootstrap confidence set becomes conservative. Numerical
experiments for local constant and local quadratic regressions illustrate the
theoretical results
Bootstrap confidence sets under model misspecification
A multiplier bootstrap procedure for construction of likelihood-based
confidence sets is considered for finite samples and a possible model
misspecification. Theoretical results justify the bootstrap validity for a
small or moderate sample size and allow to control the impact of the parameter
dimension : the bootstrap approximation works if is small. The main
result about bootstrap validity continues to apply even if the underlying
parametric model is misspecified under the so-called small modelling bias
condition. In the case when the true model deviates significantly from the
considered parametric family, the bootstrap procedure is still applicable but
it becomes a bit conservative: the size of the constructed confidence sets is
increased by the modelling bias. We illustrate the results with numerical
examples for misspecified linear and logistic regressions.Comment: Published at http://dx.doi.org/10.1214/15-AOS1355 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …