2,306 research outputs found
Sampling Correctors
In many situations, sample data is obtained from a noisy or imperfect source.
In order to address such corruptions, this paper introduces the concept of a
sampling corrector. Such algorithms use structure that the distribution is
purported to have, in order to allow one to make "on-the-fly" corrections to
samples drawn from probability distributions. These algorithms then act as
filters between the noisy data and the end user.
We show connections between sampling correctors, distribution learning
algorithms, and distribution property testing algorithms. We show that these
connections can be utilized to expand the applicability of known distribution
learning and property testing algorithms as well as to achieve improved
algorithms for those tasks.
As a first step, we show how to design sampling correctors using proper
learning algorithms. We then focus on the question of whether algorithms for
sampling correctors can be more efficient in terms of sample complexity than
learning algorithms for the analogous families of distributions. When
correcting monotonicity, we show that this is indeed the case when also granted
query access to the cumulative distribution function. We also obtain sampling
correctors for monotonicity without this stronger type of access, provided that
the distribution be originally very close to monotone (namely, at a distance
). In addition to that, we consider a restricted error model
that aims at capturing "missing data" corruptions. In this model, we show that
distributions that are close to monotone have sampling correctors that are
significantly more efficient than achievable by the learning approach.
We also consider the question of whether an additional source of independent
random bits is required by sampling correctors to implement the correction
process
Changes in the distribution of male and female wages accounting for employment composition using bounds
This paper examines changes in the distribution of wages using bounds to allow for the impact of non-random selection into work. We show that bounds constructed without any economic or statistical assumptions can be informative. However, since employment rates in the UK are often low they are not informative about changes in educational or gender wage differentials. Thus we explore ways to tighten these bounds using restrictions motivated from economic theory. With these assumptions we find convincing evidence of an increase in inequality within education groups, changes in the "return" to education and increases in the relative wages of women.
On the complexity of range searching among curves
Modern tracking technology has made the collection of large numbers of
densely sampled trajectories of moving objects widely available. We consider a
fundamental problem encountered when analysing such data: Given polygonal
curves in , preprocess into a data structure that answers
queries with a query curve and radius for the curves of that
have \Frechet distance at most to .
We initiate a comprehensive analysis of the space/query-time trade-off for
this data structuring problem. Our lower bounds imply that any data structure
in the pointer model model that achieves query time, where is
the output size, has to use roughly space in
the worst case, even if queries are mere points (for the discrete \Frechet
distance) or line segments (for the continuous \Frechet distance). More
importantly, we show that more complex queries and input curves lead to
additional logarithmic factors in the lower bound. Roughly speaking, the number
of logarithmic factors added is linear in the number of edges added to the
query and input curve complexity. This means that the space/query time
trade-off worsens by an exponential factor of input and query complexity. This
behaviour addresses an open question in the range searching literature: whether
it is possible to avoid the additional logarithmic factors in the space and
query time of a multilevel partition tree. We answer this question negatively.
On the positive side, we show we can build data structures for the \Frechet
distance by using semialgebraic range searching. Our solution for the discrete
\Frechet distance is in line with the lower bound, as the number of levels in
the data structure is , where denotes the maximal number of vertices
of a curve. For the continuous \Frechet distance, the number of levels
increases to
Nonparametric Bounds on the Income Distribution in the Presence of Item Nonresponse
Item nonresponse in micro surveys can lead to biased estimates of the parameters of interest if such nonresponse is nonrandom. Selection models can be used to correct for this, but parametric and semiparametric selection models require additional assumptions. Manski has recently developed a new approach, showing that, without additional assumptions, the parameters of interest are identified up to some bounding interval. In this paper, we apply Manskiâs approach to estimate the distribution function and quantiles of personal income, conditional on given covariates, taking account of item nonresponse on income. Nonparametric techniques are used to estimate the bounding intervals. We consider worst case bounds, as well as bounds which are valid under nonparametric assumptions on monotonicity or under exclusion restrictions.nonparametrics;bounds and identification;sample non-response
- âŠ