428,194 research outputs found
Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation
We introduce and study a new data sketch for processing massive datasets. It
addresses two common problems: 1) computing a sum given arbitrary filter
conditions and 2) identifying the frequent items or heavy hitters in a data
set. For the former, the sketch provides unbiased estimates with state of the
art accuracy. It handles the challenging scenario when the data is
disaggregated so that computing the per unit metric of interest requires an
expensive aggregation. For example, the metric of interest may be total clicks
per user while the raw data is a click stream with multiple rows per user. Thus
the sketch is suitable for use in a wide range of applications including
computing historical click through rates for ad prediction, reporting user
metrics from event streams, and measuring network traffic for IP flows.
We prove and empirically show the sketch has good properties for both the
disaggregated subset sum estimation and frequent item problems. On i.i.d. data,
it not only picks out the frequent items but gives strongly consistent
estimates for the proportion of each frequent item. The resulting sketch
asymptotically draws a probability proportional to size sample that is optimal
for estimating sums over the data. For non i.i.d. data, we show that it
typically does much better than random sampling for the frequent item problem
and never does worse. For subset sum estimation, we show that even for
pathological sequences, the variance is close to that of an optimal sampling
design. Empirically, despite the disadvantage of operating on disaggregated
data, our method matches or bests priority sampling, a state of the art method
for pre-aggregated data and performs orders of magnitude better on skewed data
compared to uniform sampling. We propose extensions to the sketch that allow it
to be used in combining multiple data sets, in distributed systems, and for
time decayed aggregation
Recommended from our members
Precipitation and latent heating distributions from satellite passive microwave radiometry. Part I: improved method and uncertainties
A revised Bayesian algorithm for estimating surface rain rate, convective rain proportion, and latent heating profiles from satellite-borne passive microwave radiometer observations over ocean backgrounds is described. The algorithm searches a large database of cloud-radiative model simulations to find cloud profiles that are radiatively consistent with a given set of microwave radiance measurements. The properties of these radiatively consistent profiles are then composited to obtain best estimates of the observed properties. The revised algorithm is supported by an expanded and more physically consistent database of cloud-radiative model simulations. The algorithm also features a better quantification of the convective and nonconvective contributions to total rainfall, a new geographic database, and an improved representation of background radiances in rain-free regions. Bias and random error estimates are derived from applications of the algorithm to synthetic radiance data, based upon a subset of cloud-resolving model simulations, and from the Bayesian formulation itself. Synthetic rain-rate and latent heating estimates exhibit a trend of high (low) bias for low (high) retrieved values. The Bayesian estimates of random error are propagated to represent errors at coarser time and space resolutions, based upon applications of the algorithm to TRMM Microwave Imager (TMI) data. Errors in TMI instantaneous rain-rate estimates at 0.5°-resolution range from approximately 50% at 1 mm h−1 to 20% at 14 mm h−1. Errors in collocated spaceborne radar rain-rate estimates are roughly 50%–80% of the TMI errors at this resolution. The estimated algorithm random error in TMI rain rates at monthly, 2.5° resolution is relatively small (less than 6% at 5 mm day−1) in comparison with the random error resulting from infrequent satellite temporal sampling (8%–35% at the same rain rate). Percentage errors resulting from sampling decrease with increasing rain rate, and sampling errors in latent heating rates follow the same trend. Averaging over 3 months reduces sampling errors in rain rates to 6%–15% at 5 mm day−1, with proportionate reductions in latent heating sampling errors
Endogenous Sampling and Matching Method in Duration Models
Endogenous sampling with matching (also called gmixed samplingh) occurs when the statistician samples from the non-right- censored subset at a predetermined proportion and matches on one or more exogenous variables when sampling from the right-censored subset. This is widely applied in the duration analysis of firm failures, loan defaults, insurer insolvencies, and so on, due to the low frequency of observing non-right-censored samples (bankrupt, default, and insolvent observations in respective examples). However, the common practice of using estimation procedures intended for random sampling or for the qualitative response model will yield either an inconsistent or inefficient estimator. This paper proposes a consistent and efficient estimator and investigates its asymptotic properties. In addition, this paper evaluates the magnitude of asymptotic bias when the model is estimated as if it were a random sample or an endogenous sample without matching. This paper also compares the relative efficiency of other commonly used estimators and provides a general guideline for optimally choosing sample designs. The Monte Carlo study with a simple example shows that random sampling yields an estimator of poor finite sample properties when the population is extremely unbalanced in terms of default and non-default cases while endogenous sampling and mixed sampling are robust in this situation.Duration models; Endogenous sampling with matching; Maximum likelihood estimator; Manski-Lerman estimator; Asymptotic distribution
Subspace Evolution and Transfer (SET) for Low-Rank Matrix Completion
We describe a new algorithm, termed subspace evolution and transfer (SET),
for solving low-rank matrix completion problems. The algorithm takes as its
input a subset of entries of a low-rank matrix, and outputs one low-rank matrix
consistent with the given observations. The completion task is accomplished by
searching for a column space on the Grassmann manifold that matches the
incomplete observations. The SET algorithm consists of two parts -- subspace
evolution and subspace transfer. In the evolution part, we use a gradient
descent method on the Grassmann manifold to refine our estimate of the column
space. Since the gradient descent algorithm is not guaranteed to converge, due
to the existence of barriers along the search path, we design a new mechanism
for detecting barriers and transferring the estimated column space across the
barriers. This mechanism constitutes the core of the transfer step of the
algorithm. The SET algorithm exhibits excellent empirical performance for both
high and low sampling rate regimes
Gaussian processes with linear operator inequality constraints
This paper presents an approach for constrained Gaussian Process (GP)
regression where we assume that a set of linear transformations of the process
are bounded. It is motivated by machine learning applications for
high-consequence engineering systems, where this kind of information is often
made available from phenomenological knowledge. We consider a GP over
functions on taking values in
, where the process is still Gaussian when
is a linear operator. Our goal is to model under the
constraint that realizations of are confined to a convex set of
functions. In particular, we require that , given
two functions and where pointwise. This formulation provides a
consistent way of encoding multiple linear constraints, such as
shape-constraints based on e.g. boundedness, monotonicity or convexity. We
adopt the approach of using a sufficiently dense set of virtual observation
locations where the constraint is required to hold, and derive the exact
posterior for a conjugate likelihood. The results needed for stable numerical
implementation are derived, together with an efficient sampling scheme for
estimating the posterior process.Comment: Published in JMLR: http://jmlr.org/papers/volume20/19-065/19-065.pd
- …
