2,306 research outputs found

    Sampling Correctors

    Full text link
    In many situations, sample data is obtained from a noisy or imperfect source. In order to address such corruptions, this paper introduces the concept of a sampling corrector. Such algorithms use structure that the distribution is purported to have, in order to allow one to make "on-the-fly" corrections to samples drawn from probability distributions. These algorithms then act as filters between the noisy data and the end user. We show connections between sampling correctors, distribution learning algorithms, and distribution property testing algorithms. We show that these connections can be utilized to expand the applicability of known distribution learning and property testing algorithms as well as to achieve improved algorithms for those tasks. As a first step, we show how to design sampling correctors using proper learning algorithms. We then focus on the question of whether algorithms for sampling correctors can be more efficient in terms of sample complexity than learning algorithms for the analogous families of distributions. When correcting monotonicity, we show that this is indeed the case when also granted query access to the cumulative distribution function. We also obtain sampling correctors for monotonicity without this stronger type of access, provided that the distribution be originally very close to monotone (namely, at a distance O(1/log⁥2n)O(1/\log^2 n)). In addition to that, we consider a restricted error model that aims at capturing "missing data" corruptions. In this model, we show that distributions that are close to monotone have sampling correctors that are significantly more efficient than achievable by the learning approach. We also consider the question of whether an additional source of independent random bits is required by sampling correctors to implement the correction process

    Changes in the distribution of male and female wages accounting for employment composition using bounds

    Get PDF
    This paper examines changes in the distribution of wages using bounds to allow for the impact of non-random selection into work. We show that bounds constructed without any economic or statistical assumptions can be informative. However, since employment rates in the UK are often low they are not informative about changes in educational or gender wage differentials. Thus we explore ways to tighten these bounds using restrictions motivated from economic theory. With these assumptions we find convincing evidence of an increase in inequality within education groups, changes in the "return" to education and increases in the relative wages of women.

    On the complexity of range searching among curves

    Full text link
    Modern tracking technology has made the collection of large numbers of densely sampled trajectories of moving objects widely available. We consider a fundamental problem encountered when analysing such data: Given nn polygonal curves SS in Rd\mathbb{R}^d, preprocess SS into a data structure that answers queries with a query curve qq and radius ρ\rho for the curves of SS that have \Frechet distance at most ρ\rho to qq. We initiate a comprehensive analysis of the space/query-time trade-off for this data structuring problem. Our lower bounds imply that any data structure in the pointer model model that achieves Q(n)+O(k)Q(n) + O(k) query time, where kk is the output size, has to use roughly Ω((n/Q(n))2)\Omega\left((n/Q(n))^2\right) space in the worst case, even if queries are mere points (for the discrete \Frechet distance) or line segments (for the continuous \Frechet distance). More importantly, we show that more complex queries and input curves lead to additional logarithmic factors in the lower bound. Roughly speaking, the number of logarithmic factors added is linear in the number of edges added to the query and input curve complexity. This means that the space/query time trade-off worsens by an exponential factor of input and query complexity. This behaviour addresses an open question in the range searching literature: whether it is possible to avoid the additional logarithmic factors in the space and query time of a multilevel partition tree. We answer this question negatively. On the positive side, we show we can build data structures for the \Frechet distance by using semialgebraic range searching. Our solution for the discrete \Frechet distance is in line with the lower bound, as the number of levels in the data structure is O(t)O(t), where tt denotes the maximal number of vertices of a curve. For the continuous \Frechet distance, the number of levels increases to O(t2)O(t^2)

    Nonparametric Bounds on the Income Distribution in the Presence of Item Nonresponse

    Get PDF
    Item nonresponse in micro surveys can lead to biased estimates of the parameters of interest if such nonresponse is nonrandom. Selection models can be used to correct for this, but parametric and semiparametric selection models require additional assumptions. Manski has recently developed a new approach, showing that, without additional assumptions, the parameters of interest are identified up to some bounding interval. In this paper, we apply Manski’s approach to estimate the distribution function and quantiles of personal income, conditional on given covariates, taking account of item nonresponse on income. Nonparametric techniques are used to estimate the bounding intervals. We consider worst case bounds, as well as bounds which are valid under nonparametric assumptions on monotonicity or under exclusion restrictions.nonparametrics;bounds and identification;sample non-response
    • 

    corecore