428,194 research outputs found

    Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

    Full text link
    We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the former, the sketch provides unbiased estimates with state of the art accuracy. It handles the challenging scenario when the data is disaggregated so that computing the per unit metric of interest requires an expensive aggregation. For example, the metric of interest may be total clicks per user while the raw data is a click stream with multiple rows per user. Thus the sketch is suitable for use in a wide range of applications including computing historical click through rates for ad prediction, reporting user metrics from event streams, and measuring network traffic for IP flows. We prove and empirically show the sketch has good properties for both the disaggregated subset sum estimation and frequent item problems. On i.i.d. data, it not only picks out the frequent items but gives strongly consistent estimates for the proportion of each frequent item. The resulting sketch asymptotically draws a probability proportional to size sample that is optimal for estimating sums over the data. For non i.i.d. data, we show that it typically does much better than random sampling for the frequent item problem and never does worse. For subset sum estimation, we show that even for pathological sequences, the variance is close to that of an optimal sampling design. Empirically, despite the disadvantage of operating on disaggregated data, our method matches or bests priority sampling, a state of the art method for pre-aggregated data and performs orders of magnitude better on skewed data compared to uniform sampling. We propose extensions to the sketch that allow it to be used in combining multiple data sets, in distributed systems, and for time decayed aggregation

    Endogenous Sampling and Matching Method in Duration Models

    Get PDF
    Endogenous sampling with matching (also called gmixed samplingh) occurs when the statistician samples from the non-right- censored subset at a predetermined proportion and matches on one or more exogenous variables when sampling from the right-censored subset. This is widely applied in the duration analysis of firm failures, loan defaults, insurer insolvencies, and so on, due to the low frequency of observing non-right-censored samples (bankrupt, default, and insolvent observations in respective examples). However, the common practice of using estimation procedures intended for random sampling or for the qualitative response model will yield either an inconsistent or inefficient estimator. This paper proposes a consistent and efficient estimator and investigates its asymptotic properties. In addition, this paper evaluates the magnitude of asymptotic bias when the model is estimated as if it were a random sample or an endogenous sample without matching. This paper also compares the relative efficiency of other commonly used estimators and provides a general guideline for optimally choosing sample designs. The Monte Carlo study with a simple example shows that random sampling yields an estimator of poor finite sample properties when the population is extremely unbalanced in terms of default and non-default cases while endogenous sampling and mixed sampling are robust in this situation.Duration models; Endogenous sampling with matching; Maximum likelihood estimator; Manski-Lerman estimator; Asymptotic distribution

    Subspace Evolution and Transfer (SET) for Low-Rank Matrix Completion

    Full text link
    We describe a new algorithm, termed subspace evolution and transfer (SET), for solving low-rank matrix completion problems. The algorithm takes as its input a subset of entries of a low-rank matrix, and outputs one low-rank matrix consistent with the given observations. The completion task is accomplished by searching for a column space on the Grassmann manifold that matches the incomplete observations. The SET algorithm consists of two parts -- subspace evolution and subspace transfer. In the evolution part, we use a gradient descent method on the Grassmann manifold to refine our estimate of the column space. Since the gradient descent algorithm is not guaranteed to converge, due to the existence of barriers along the search path, we design a new mechanism for detecting barriers and transferring the estimated column space across the barriers. This mechanism constitutes the core of the transfer step of the algorithm. The SET algorithm exhibits excellent empirical performance for both high and low sampling rate regimes

    Gaussian processes with linear operator inequality constraints

    Full text link
    This paper presents an approach for constrained Gaussian Process (GP) regression where we assume that a set of linear transformations of the process are bounded. It is motivated by machine learning applications for high-consequence engineering systems, where this kind of information is often made available from phenomenological knowledge. We consider a GP ff over functions on XRn\mathcal{X} \subset \mathbb{R}^{n} taking values in R\mathbb{R}, where the process Lf\mathcal{L}f is still Gaussian when L\mathcal{L} is a linear operator. Our goal is to model ff under the constraint that realizations of Lf\mathcal{L}f are confined to a convex set of functions. In particular, we require that aLfba \leq \mathcal{L}f \leq b, given two functions aa and bb where a<ba < b pointwise. This formulation provides a consistent way of encoding multiple linear constraints, such as shape-constraints based on e.g. boundedness, monotonicity or convexity. We adopt the approach of using a sufficiently dense set of virtual observation locations where the constraint is required to hold, and derive the exact posterior for a conjugate likelihood. The results needed for stable numerical implementation are derived, together with an efficient sampling scheme for estimating the posterior process.Comment: Published in JMLR: http://jmlr.org/papers/volume20/19-065/19-065.pd
    corecore