23,376 research outputs found

    Weak consistency of the 1-nearest neighbor measure with applications to missing data

    Full text link
    When data is partially missing at random, imputation and importance weighting are often used to estimate moments of the unobserved population. In this paper, we study 1-nearest neighbor (1NN) importance weighting, which estimates moments by replacing missing data with the complete data that is the nearest neighbor in the non-missing covariate space. We define an empirical measure, the 1NN measure, and show that it is weakly consistent for the measure of the missing data. The main idea behind this result is that the 1NN measure is performing inverse probability weighting in the limit. We study applications to missing data and mitigating the impact of covariate shift in prediction tasks

    The random link approximation for the Euclidean traveling salesman problem

    Full text link
    The traveling salesman problem (TSP) consists of finding the length of the shortest closed tour visiting N ``cities''. We consider the Euclidean TSP where the cities are distributed randomly and independently in a d-dimensional unit hypercube. Working with periodic boundary conditions and inspired by a remarkable universality in the kth nearest neighbor distribution, we find for the average optimum tour length = beta_E(d) N^{1-1/d} [1+O(1/N)] with beta_E(2) = 0.7120 +- 0.0002 and beta_E(3) = 0.6979 +- 0.0002. We then derive analytical predictions for these quantities using the random link approximation, where the lengths between cities are taken as independent random variables. From the ``cavity'' equations developed by Krauth, Mezard and Parisi, we calculate the associated random link values beta_RL(d). For d=1,2,3, numerical results show that the random link approximation is a good one, with a discrepancy of less than 2.1% between beta_E(d) and beta_RL(d). For large d, we argue that the approximation is exact up to O(1/d^2) and give a conjecture for beta_E(d), in terms of a power series in 1/d, specifying both leading and subleading coefficients.Comment: 29 pages, 6 figures; formatting and typos correcte

    Integration of survey data and big observational data for finite population inference using mass imputation

    Get PDF
    Multiple data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we consider an imputation approach to combining a probability sample with big observational data. Unlike the usual imputation for missing data analysis, we create imputed values for the whole elements in the probability sample. Such mass imputation is attractive in the context of survey data integration (Kim and Rao, 2012). We extend mass imputation as a tool for data integration of survey data and big non-survey data. The mass imputation methods and their statistical properties are presented. The matching estimator of Rivers (2007) is also covered as a special case. Variance estimation with mass-imputed data is discussed. The simulation results demonstrate the proposed estimators outperform existing competitors in terms of robustness and efficiency

    A video method for quantifying size distribution, density, and three-dimensional spatial structure of reef fish spawning aggregations

    Get PDF
    There is a clear need to develop fisheries independent methods to quantify individual sizes, density, and three dimensional characteristics of reef fish spawning aggregations for use in population assessments and to provide critical baseline data on reproductive life history of exploited populations. We designed, constructed, calibrated, and applied an underwater stereo-video system to estimate individual sizes and three dimensional (3D) positions of Nassau grouper (Epinephelus striatus) at a spawning aggregation site located on a reef promontory on the western edge of Little Cayman Island, Cayman Islands, BWI, on 23 January 2003. The system consists of two free-running camcorders mounted on a meter-long bar and supported by a SCUBA diver. Paired video “stills” were captured, and nose and tail of individual fish observed in the field of view of both cameras were digitized using image analysis software. Conversion of these two dimensional screen coordinates to 3D coordinates was achieved through a matrix inversion algorithm and calibration data. Our estimate of mean total length (58.5 cm, n = 29) was in close agreement with estimated lengths from a hydroacoustic survey and from direct measures of fish size using visual census techniques. We discovered a possible bias in length measures using the video method, most likely arising from some fish orientations that were not perpendicular with respect to the optical axis of the camera system. We observed 40 individuals occupying a volume of 33.3 m3, resulting in a concentration of 1.2 individuals m–3 with a mean (SD) nearest neighbor distance of 70.0 (29.7) cm. We promote the use of roving diver stereo-videography as a method to assess the size distribution, density, and 3D spatial structure of fish spawning aggregations

    Stabilized Nearest Neighbor Classifier and Its Statistical Properties

    Full text link
    The stability of statistical analysis is an important indicator for reproducibility, which is one main principle of scientific method. It entails that similar statistical conclusions can be reached based on independent samples from the same underlying population. In this paper, we introduce a general measure of classification instability (CIS) to quantify the sampling variability of the prediction made by a classification method. Interestingly, the asymptotic CIS of any weighted nearest neighbor classifier turns out to be proportional to the Euclidean norm of its weight vector. Based on this concise form, we propose a stabilized nearest neighbor (SNN) classifier, which distinguishes itself from other nearest neighbor classifiers, by taking the stability into consideration. In theory, we prove that SNN attains the minimax optimal convergence rate in risk, and a sharp convergence rate in CIS. The latter rate result is established for general plug-in classifiers under a low-noise condition. Extensive simulated and real examples demonstrate that SNN achieves a considerable improvement in CIS over existing nearest neighbor classifiers, with comparable classification accuracy. We implement the algorithm in a publicly available R package snn.Comment: 48 Pages, 11 Figures. To Appear in JASA--T&

    Direct Ensemble Estimation of Density Functionals

    Full text link
    Estimating density functionals of analog sources is an important problem in statistical signal processing and information theory. Traditionally, estimating these quantities requires either making parametric assumptions about the underlying distributions or using non-parametric density estimation followed by integration. In this paper we introduce a direct nonparametric approach which bypasses the need for density estimation by using the error rates of k-NN classifiers asdata-driven basis functions that can be combined to estimate a range of density functionals. However, this method is subject to a non-trivial bias that dramatically slows the rate of convergence in higher dimensions. To overcome this limitation, we develop an ensemble method for estimating the value of the basis function which, under some minor constraints on the smoothness of the underlying distributions, achieves the parametric rate of convergence regardless of data dimension.Comment: 5 page
    • …
    corecore