Estimating the distance distribution of subpopulations and testing observation outlyingness for a large-scale complex survey

Abstract

Many finite populations targeted by sample surveys comprise a relatively small number of homogeneous subpopulations. In on-going survey operations, it is often of interest to be able to assess whether a new observation belongs to one of those subpopulations or should be flagged as not belonging to any of them. Because the homogeneity of the subpopulations depends on potentially large numbers of survey variables interacting in complex ways, we define a distance measure in the space induced by the survey variables, and consider the distribution of these distances within the subpopulation as a summary of the distributional characteristics of the subpopulation. We also define a measure of the outlyingness of each individual point as the fraction of points with a less extreme distance in the subpopulation. In this thesis, we propose a sample-based estimator for the subpopulation distance distribution functions and measure of outlyingness. We allow for a general distance measure, and consider both multivariate means and medians as centers. We describe the design-based asymptotic properties of the estimator under weak assumptions on the finite population. We investigate several approaches for design-based variance estimation, including a combination of kernel regression and replication variance estimation. The practical properties of the procedures are evaluated in a longitudinal complex survey called the National Resources Inventory.;The theoretical derivations for sample distance distribution can be generalized to a broad class of survey estimators, involving nondifferentiable functions defined at the population level. We extend the theoretical results to two classes nondifferentiable survey estimators, statistics as nondifferentiable functions of estimated parameters and estimators implicitly defined through estimating equations. In both cases, a direct Taylor linearization is not applicable, but we can assume the limiting function is differentiable and do asymptotic expansion on the limits. We offer a design-based perspective of both cases in this thesis and justify the design assumptions under probabilistic mechanism. We also propose a variance estimator using kernel smoothing and show its consistency

    Similar works