164 research outputs found

    Multiple Outlier Detection: Hypothesis Tests versus Model Selection by Information Criteria

    Get PDF
    The detection of multiple outliers can be interpreted as a model selection problem. Models that can be selected are the null model, which indicates an outlier free set of observations, or a class of alternative models, which contain a set of additional bias parameters. A common way to select the right model is by using a statistical hypothesis test. In geodesy data snooping is most popular. Another approach arises from information theory. Here, the Akaike information criterion (AIC) is used to select an appropriate model for a given set of observations. The AIC is based on the Kullback-Leibler divergence, which describes the discrepancy between the model candidates. Both approaches are discussed and applied to test problems: the fitting of a straight line and a geodetic network. Some relationships between data snooping and information criteria are discussed. When compared, it turns out that the information criteria approach is more simple and elegant. Along with AIC there are many alternative information criteria for selecting different outliers, and it is not clear which one is optimal

    Randomization tests for experiments embedded in complex surveys

    Get PDF
    Includes bibliographical references.2022 Fall.Embedding experiments in complex surveys has become increasingly important. For scientific questions, such embedding allows researchers to take advantage of both the internal validity of controlled experiments and the external validity of probability-based samples of a population. Within survey statistics, declining response rates have led to the development of new methods, known as adaptive and responsive survey designs, that try to increase or maintain response rates without negatively impacting survey quality. Such methodologies are assessed experimentally. Examples include a series of embedded experiments in the 2019 Triennial Community Health Survey (TCHS), conducted by the Health District of Northern Larimer County in collaboration with the Department of Statistics at Colorado State University, to determine the effects of monetary incentives, targeted mailing of reminders, and double-stuffed envelopes (including both English and Spanish versions of the survey) on response rates, cost, and representativeness of the sample. This dissertation develops methodology and theory of randomization-based tests embedded in complex surveys, assesses the methodology via simulation, and applies the methods to data from the 2019 TCHS. An important consideration in experiments to increase response rates is the overall balance of the sample, because higher overall response might still underrepresent important groups. There have been advances in recent years on methods to assess the representativeness of samples, including application of the dissimilarity index (DI) to help evaluate the representativeness of a sample under the different conditions in an incentive experiment (Biemer et al. [2018]). We develop theory and methodology for design-based inference for the DI when used in a complex survey. Simulation studies show that the linearization method has good properties, with good confidence interval coverage even in cases when the true DI is close to zero, even though point estimates may be biased. We then develop a class of randomization tests for evaluating experiments embedded in complex surveys. We consider a general parametric contrast, estimated using the design-weighted Narain-Horvitz-Thompson (NHT) approach, in either a completely randomized design or a randomized complete block design embedded in a complex survey. We derive asymptotic normal approximations for the randomization distribution of a general contrast, from which critical values can be derived for testing the null hypothesis that the contrast is zero. The asymptotic results are conditioned on the complex sample, but we include results showing that, under mild conditions, the inference extends to the finite population. Further, we develop asymptotic power properties of the tests under moderate conditions. Through simulation, we illustrate asymptotic properties of the randomization tests and compare the normal approximations of the randomization tests with corresponding Monte Carlo tests, with a design-based test developed by van den Brakel, and with randomization tests developed by Fisher-Pitman-Welch and Neyman. The randomization approach generalizes broadly to other kinds of embedded experimental designs and null hypothesis testing problems, for very general survey designs. The randomization approach is then extended from NHT estimators to generalized regression estimators that incorporate auxiliary information, and from linear contrasts to comparisons of nonlinear functions

    ardl: Estimating autoregressive distributed lag and equilibrium correction models

    Get PDF
    This is the final version. Available on open access from SAGE Publications via the DOI in this recordWe present a command, ardl, for the estimation of autoregressive distributed lag (ARDL) models in a time-series context. The ardl command can be used to fit an ARDL model with the optimal number of autoregressive and distributed lags based on the Akaike or Bayesian (Schwarz) information criterion. The regression results can be displayed in the ARDL levels form or in the error-correction representation of the model. The latter separates long-run and short-run effects and is available in two different parameterizations of the long-run (cointegrating) relationship. The popular bounds-testing procedure for the existence of a long-run levels relationship is implemented as a postestimation feature. Comprehensive critical values and approximate p-values obtained from response-surface regressions facilitate statistical inference

    Parameter Estimation of Complex Systems from Sparse and Noisy Data

    Get PDF
    Mathematical modeling is a key component of various disciplines in science and engineering. A mathematical model which represents important behavior of a real system can be used as a substitute for the real process for many analysis and synthesis tasks. The performance of model based techniques, e.g. system analysis, computer simulation, controller design, sensor development, state filtering, product monitoring, and process optimization, is highly dependent on the quality of the model used. Therefore, it is very important to be able to develop an accurate model from available experimental data. Parameter estimation is usually formulated as an optimization problem where the parameter estimate is computed by minimizing the discrepancy between the model prediction and the experimental data. If a simple model and a large amount of data are available then the estimation problem is frequently well-posed and a small error in data fitting automatically results in an accurate model. However, this is not always the case. If the model is complex and only sparse and noisy data are available, then the estimation problem is often ill-conditioned and good data fitting does not ensure accurate model predictions. Many challenges that can often be neglected for estimation involving simple models need to be carefully considered for estimation problems involving complex models. To obtain a reliable and accurate estimate from sparse and noisy data, a set of techniques is developed by addressing the challenges encountered in estimation of complex models, including (1) model analysis and simplification which identifies the important sources of uncertainty and reduces the model complexity; (2) experimental design for collecting information-rich data by setting optimal experimental conditions; (3) regularization of estimation problem which solves the ill-conditioned large-scale optimization problem by reducing the number of parameters; (4) nonlinear estimation and filtering which fits the data by various estimation and filtering algorithms; (5) model verification by applying statistical hypothesis test to the prediction error. The developed methods are applied to different types of models ranging from models found in the process industries to biochemical networks, some of which are described by ordinary differential equations with dozens of state variables and more than a hundred parameters

    Cosmographic Hubble fits to the supernova data

    Full text link
    The Hubble relation between distance and redshift is a purely cosmographic relation that depends only on the symmetries of a FLRW spacetime, but does not intrinsically make any dynamical assumptions. This suggests that it should be possible to estimate the parameters defining the Hubble relation without making any dynamical assumptions. To test this idea, we perform a number of inter-related cosmographic fits to the legacy05 and gold06 supernova datasets. Based on this supernova data, the "preponderance of evidence" certainly suggests an accelerating universe. However we would argue that (unless one uses additional dynamical and observational information) this conclusion is not currently supported "beyond reasonable doubt". As part of the analysis we develop two particularly transparent graphical representations of the redshift-distance relation -- representations in which acceleration versus deceleration reduces to the question of whether the relevant graph slopes up or down. Turning to the details of the cosmographic fits, three issues in particular concern us: First, the fitted value for the deceleration parameter changes significantly depending on whether one performs a chi^2 fit to the luminosity distance, proper motion distance or other suitable distance surrogate. Second, the fitted value for the deceleration parameter changes significantly depending on whether one uses the traditional redshift variable z, or what we shall argue is on theoretical grounds an improved parameterization y=z/(1+z). Third, the published estimates for systematic uncertainties are sufficiently large that they certainly impact on, and to a large extent undermine, the usual purely statistical tests of significance. We conclude that the supernova data should be treated with some caution.Comment: 28 pages, 4 figure

    Intersection Bounds: Estimation and Inference

    Get PDF
    We develop a practical and novel method for inference on intersection bounds, namely bounds defined by either the infimum or supremum of a parametric or nonparametric function, or equivalently, the value of a linear programming problem with a potentially infinite constraint set. We show that many bounds characterizations in econometrics, for instance bounds on parameters under conditional moment inequalities, can be formulated as intersection bounds. Our approach is especially convenient for models comprised of a continuum of inequalities that are separable in parameters, and also applies to models with inequalities that are non-separable in parameters. Since analog estimators for intersection bounds can be severely biased in finite samples, routinely underestimating the size of the identified set, we also offer a median-bias-corrected estimator of such bounds as a by-product of our inferential procedures. We develop theory for large sample inference based on the strong approximation of a sequence of series or kernel-based empirical processes by a sequence of "penultimate" Gaussian processes. These penultimate processes are generally not weakly convergent, and thus non-Donsker. Our theoretical results establish that we can nonetheless perform asymptotically valid inference based on these processes. Our construction also provides new adaptive inequality/moment selection methods. We provide conditions for the use of nonparametric kernel and series estimators, including a novel result that establishes strong approximation for any general series estimator admitting linearization, which may be of independent interest
    • …
    corecore