1,803 research outputs found

    Statistical thinking: From Tukey to Vardi and beyond

    Get PDF
    Data miners (minors?) and neural networkers tend to eschew modelling, misled perhaps by misinterpretation of strongly expressed views of John Tukey. I discuss Vardi's views of these issues as well as other aspects of Vardi's work in emision tomography and in sampling bias.Comment: Published at http://dx.doi.org/10.1214/074921707000000210 in the IMS Lecture Notes Monograph Series (http://www.imstat.org/publications/lecnotes.htm) by the Institute of Mathematical Statistics (http://www.imstat.org

    Permutation graphs, fast forward permutations, and sampling the cycle structure of a permutation

    Full text link
    A permutation P on {1,..,N} is a_fast_forward_permutation_ if for each m the computational complexity of evaluating P^m(x)$ is small independently of m and x. Naor and Reingold constructed fast forward pseudorandom cycluses and involutions. By studying the evolution of permutation graphs, we prove that the number of queries needed to distinguish a random cyclus from a random permutation on {1,..,N} is Theta(N) if one does not use queries of the form P^m(x), but is only Theta(1) if one is allowed to make such queries. We construct fast forward permutations which are indistinguishable from random permutations even when queries of the form P^m(x) are allowed. This is done by introducing an efficient method to sample the cycle structure of a random permutation, which in turn solves an open problem of Naor and Reingold.Comment: Corrected a small erro

    Distinct counting with a self-learning bitmap

    Full text link
    Counting the number of distinct elements (cardinality) in a dataset is a fundamental problem in database management. In recent years, due to many of its modern applications, there has been significant interest to address the distinct counting problem in a data stream setting, where each incoming data can be seen only once and cannot be stored for long periods of time. Many probabilistic approaches based on either sampling or sketching have been proposed in the computer science literature, that only require limited computing and memory resources. However, the performances of these methods are not scale-invariant, in the sense that their relative root mean square estimation errors (RRMSE) depend on the unknown cardinalities. This is not desirable in many applications where cardinalities can be very dynamic or inhomogeneous and many cardinalities need to be estimated. In this paper, we develop a novel approach, called self-learning bitmap (S-bitmap) that is scale-invariant for cardinalities in a specified range. S-bitmap uses a binary vector whose entries are updated from 0 to 1 by an adaptive sampling process for inferring the unknown cardinality, where the sampling rates are reduced sequentially as more and more entries change from 0 to 1. We prove rigorously that the S-bitmap estimate is not only unbiased but scale-invariant. We demonstrate that to achieve a small RRMSE value of ϵ\epsilon or less, our approach requires significantly less memory and consumes similar or less operations than state-of-the-art methods for many common practice cardinality scales. Both simulation and experimental studies are reported.Comment: Journal of the American Statistical Association (accepted

    Chain Plot: A Tool for Exploiting Bivariate Temporal Structures

    Get PDF
    In this paper we present a graphical tool useful for visualizing the cyclic behaviour of bivariate time series. We investigate its properties and link it to the asymmetry of the two variables concerned. We also suggest adding approximate confidence bounds to the points on the plot and investigate the effect of lagging to the chain plot. We conclude our paper by some standard Fourier analysis, relating and comparing this to the chain plot

    Drift rate control of a Brownian processing system

    Full text link
    A system manager dynamically controls a diffusion process Z that lives in a finite interval [0,b]. Control takes the form of a negative drift rate \theta that is chosen from a fixed set A of available values. The controlled process evolves according to the differential relationship dZ=dX-\theta(Z) dt+dL-dU, where X is a (0,\sigma) Brownian motion, and L and U are increasing processes that enforce a lower reflecting barrier at Z=0 and an upper reflecting barrier at Z=b, respectively. The cumulative cost process increases according to the differential relationship d\xi =c(\theta(Z)) dt+p dU, where c(\cdot) is a nondecreasing cost of control and p>0 is a penalty rate associated with displacement at the upper boundary. The objective is to minimize long-run average cost. This problem is solved explicitly, which allows one to also solve the following, essentially equivalent formulation: minimize the long-run average cost of control subject to an upper bound constraint on the average rate at which U increases. The two special problem features that allow an explicit solution are the use of a long-run average cost criterion, as opposed to a discounted cost criterion, and the lack of state-related costs other than boundary displacement penalties. The application of this theory to power control in wireless communication is discussed.Comment: Published at http://dx.doi.org/10.1214/105051604000000855 in the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

    ROC and the bounds on tail probabilities via theorems of Dubins and F. Riesz

    Full text link
    For independent XX and YY in the inequality P(XY+μ)P(X\leq Y+\mu), we give sharp lower bounds for unimodal distributions having finite variance, and sharp upper bounds assuming symmetric densities bounded by a finite constant. The lower bounds depend on a result of Dubins about extreme points and the upper bounds depend on a symmetric rearrangement theorem of F. Riesz. The inequality was motivated by medical imaging: find bounds on the area under the Receiver Operating Characteristic curve (ROC).Comment: Published in at http://dx.doi.org/10.1214/08-AAP536 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Spatial methods for event reconstruction in CLEAN

    Full text link
    In CLEAN (Cryogenic Low Energy Astrophysics with Noble gases), a proposed neutrino and dark matter detector, background discrimination is possible if one can determine the location of an ionizing radiation event with high accuracy. We simulate ionizing radiation events that produce multiple scintillation photons within a spherical detection volume filled with liquid neon. We estimate the radial location of a particular ionizing radiation event based on the observed count data corresponding to that event. The count data are collected by detectors mounted at the spherical boundary of the detection volume. We neglect absorption, but account for Rayleigh scattering. To account for wavelength-shifting of the scintillation light, we assume that photons are absorbed and re-emitted at the detectors. Here, we develop spatial Maximum Likelihood methods for event reconstruction, and study their performance in computer simulation experiments. We also study a method based on the centroid of the observed count data. We calibrate our estimates based on training data
    corecore