164 research outputs found
Multiple Outlier Detection: Hypothesis Tests versus Model Selection by Information Criteria
The detection of multiple outliers can be interpreted as a model selection problem. Models that can be selected are the null model, which indicates an outlier free set of observations, or a class of alternative models, which contain a set of additional bias parameters. A common way to select the right model is by using a statistical hypothesis test. In geodesy data snooping is most popular. Another approach arises from information theory. Here, the Akaike information criterion (AIC) is used to select an appropriate model for a given set of observations. The AIC is based on the Kullback-Leibler divergence, which describes the discrepancy between the model candidates. Both approaches are discussed and applied to test problems: the fitting of a straight line and a geodetic network. Some relationships between data snooping and information criteria are discussed. When compared, it turns out that the information criteria approach is more simple and elegant. Along with AIC there are many alternative information criteria for selecting different outliers, and it is not clear which one is optimal
Randomization tests for experiments embedded in complex surveys
Includes bibliographical references.2022 Fall.Embedding experiments in complex surveys has become increasingly important. For scientific questions, such embedding allows researchers to take advantage of both the internal validity of controlled experiments and the external validity of probability-based samples of a population. Within survey statistics, declining response rates have led to the development of new methods, known as adaptive and responsive survey designs, that try to increase or maintain response rates without negatively impacting survey quality. Such methodologies are assessed experimentally. Examples include a series of embedded experiments in the 2019 Triennial Community Health Survey (TCHS), conducted by the Health District of Northern Larimer County in collaboration with the Department of Statistics at Colorado State University, to determine the effects of monetary incentives, targeted mailing of reminders, and double-stuffed envelopes (including both English and Spanish versions of the survey) on response rates, cost, and representativeness of the sample. This dissertation develops methodology and theory of randomization-based tests embedded in complex surveys, assesses the methodology via simulation, and applies the methods to data from the 2019 TCHS. An important consideration in experiments to increase response rates is the overall balance of the sample, because higher overall response might still underrepresent important groups. There have been advances in recent years on methods to assess the representativeness of samples, including application of the dissimilarity index (DI) to help evaluate the representativeness of a sample under the different conditions in an incentive experiment (Biemer et al. [2018]). We develop theory and methodology for design-based inference for the DI when used in a complex survey. Simulation studies show that the linearization method has good properties, with good confidence interval coverage even in cases when the true DI is close to zero, even though point estimates may be biased. We then develop a class of randomization tests for evaluating experiments embedded in complex surveys. We consider a general parametric contrast, estimated using the design-weighted Narain-Horvitz-Thompson (NHT) approach, in either a completely randomized design or a randomized complete block design embedded in a complex survey. We derive asymptotic normal approximations for the randomization distribution of a general contrast, from which critical values can be derived for testing the null hypothesis that the contrast is zero. The asymptotic results are conditioned on the complex sample, but we include results showing that, under mild conditions, the inference extends to the finite population. Further, we develop asymptotic power properties of the tests under moderate conditions. Through simulation, we illustrate asymptotic properties of the randomization tests and compare the normal approximations of the randomization tests with corresponding Monte Carlo tests, with a design-based test developed by van den Brakel, and with randomization tests developed by Fisher-Pitman-Welch and Neyman. The randomization approach generalizes broadly to other kinds of embedded experimental designs and null hypothesis testing problems, for very general survey designs. The randomization approach is then extended from NHT estimators to generalized regression estimators that incorporate auxiliary information, and from linear contrasts to comparisons of nonlinear functions
ardl: Estimating autoregressive distributed lag and equilibrium correction models
This is the final version. Available on open access from SAGE Publications via the DOI in this recordWe present a command, ardl, for the estimation of autoregressive distributed lag (ARDL) models in a time-series context. The ardl command can be used to fit an ARDL model with the optimal number of autoregressive and distributed lags based on the Akaike or Bayesian (Schwarz) information criterion. The regression results can be displayed in the ARDL levels form or in the error-correction representation of the model. The latter separates long-run and short-run effects and is available in two different parameterizations of the long-run (cointegrating) relationship. The popular bounds-testing procedure for the existence of a long-run levels relationship is implemented as a postestimation feature. Comprehensive critical values and approximate p-values obtained from response-surface regressions facilitate statistical inference
Parameter Estimation of Complex Systems from Sparse and Noisy Data
Mathematical modeling is a key component of various disciplines in science and
engineering. A mathematical model which represents important behavior of a real
system can be used as a substitute for the real process for many analysis and synthesis
tasks. The performance of model based techniques, e.g. system analysis, computer
simulation, controller design, sensor development, state filtering, product monitoring,
and process optimization, is highly dependent on the quality of the model used.
Therefore, it is very important to be able to develop an accurate model from available
experimental data.
Parameter estimation is usually formulated as an optimization problem where the
parameter estimate is computed by minimizing the discrepancy between the model
prediction and the experimental data. If a simple model and a large amount of data are
available then the estimation problem is frequently well-posed and a small error in data
fitting automatically results in an accurate model. However, this is not always the case.
If the model is complex and only sparse and noisy data are available, then the estimation
problem is often ill-conditioned and good data fitting does not ensure accurate model
predictions. Many challenges that can often be neglected for estimation involving simple
models need to be carefully considered for estimation problems involving complex
models.
To obtain a reliable and accurate estimate from sparse and noisy data, a set of
techniques is developed by addressing the challenges encountered in estimation of
complex models, including (1) model analysis and simplification which identifies the important sources of uncertainty and reduces the model complexity; (2) experimental
design for collecting information-rich data by setting optimal experimental conditions;
(3) regularization of estimation problem which solves the ill-conditioned large-scale
optimization problem by reducing the number of parameters; (4) nonlinear estimation
and filtering which fits the data by various estimation and filtering algorithms; (5) model
verification by applying statistical hypothesis test to the prediction error.
The developed methods are applied to different types of models ranging from models
found in the process industries to biochemical networks, some of which are described by
ordinary differential equations with dozens of state variables and more than a hundred
parameters
Cosmographic Hubble fits to the supernova data
The Hubble relation between distance and redshift is a purely cosmographic
relation that depends only on the symmetries of a FLRW spacetime, but does not
intrinsically make any dynamical assumptions. This suggests that it should be
possible to estimate the parameters defining the Hubble relation without making
any dynamical assumptions. To test this idea, we perform a number of
inter-related cosmographic fits to the legacy05 and gold06 supernova datasets.
Based on this supernova data, the "preponderance of evidence" certainly
suggests an accelerating universe. However we would argue that (unless one uses
additional dynamical and observational information) this conclusion is not
currently supported "beyond reasonable doubt". As part of the analysis we
develop two particularly transparent graphical representations of the
redshift-distance relation -- representations in which acceleration versus
deceleration reduces to the question of whether the relevant graph slopes up or
down. Turning to the details of the cosmographic fits, three issues in
particular concern us: First, the fitted value for the deceleration parameter
changes significantly depending on whether one performs a chi^2 fit to the
luminosity distance, proper motion distance or other suitable distance
surrogate. Second, the fitted value for the deceleration parameter changes
significantly depending on whether one uses the traditional redshift variable
z, or what we shall argue is on theoretical grounds an improved
parameterization y=z/(1+z). Third, the published estimates for systematic
uncertainties are sufficiently large that they certainly impact on, and to a
large extent undermine, the usual purely statistical tests of significance. We
conclude that the supernova data should be treated with some caution.Comment: 28 pages, 4 figure
Intersection Bounds: Estimation and Inference
We develop a practical and novel method for inference on intersection bounds,
namely bounds defined by either the infimum or supremum of a parametric or
nonparametric function, or equivalently, the value of a linear programming
problem with a potentially infinite constraint set. We show that many bounds
characterizations in econometrics, for instance bounds on parameters under
conditional moment inequalities, can be formulated as intersection bounds. Our
approach is especially convenient for models comprised of a continuum of
inequalities that are separable in parameters, and also applies to models with
inequalities that are non-separable in parameters. Since analog estimators for
intersection bounds can be severely biased in finite samples, routinely
underestimating the size of the identified set, we also offer a
median-bias-corrected estimator of such bounds as a by-product of our
inferential procedures. We develop theory for large sample inference based on
the strong approximation of a sequence of series or kernel-based empirical
processes by a sequence of "penultimate" Gaussian processes. These penultimate
processes are generally not weakly convergent, and thus non-Donsker. Our
theoretical results establish that we can nonetheless perform asymptotically
valid inference based on these processes. Our construction also provides new
adaptive inequality/moment selection methods. We provide conditions for the use
of nonparametric kernel and series estimators, including a novel result that
establishes strong approximation for any general series estimator admitting
linearization, which may be of independent interest
- …