We describe a sampling application that supports thirteen analytics data-marts by providing database samples and has been in daily operational use for over two years at one of the world’s top web portals. The main requirement for the application is the need to support ad-hoc aggregate queries against data that spans several years and which cannot be stored in its entirety at the low level needed by data-mining applications. Furthermore, since the query load is highly variable, we cannot choose a stratification of the data so we are led to use uniform sampling. Since the source data cannot be stored at the 100 % level, we have materialized our samples as substitutes for the source data. Another driving requirement is that the sampling process preserve the entire record of web activity for chosen users- if a chosen user appears in multiple populations (tables) she should appear in all the samples derived from those tables. This reduces variance of aggregate metrics based on joining samples and allows us to do longitudinal studies of the chosen users over long time periods. We use Bernoulli (“coin-flip”) sampling because it uses O(1) space whereas reservoir methods require holding a large in-memory array. However, Bernoulli sampling produces variable-sized samples and slightly wider confidence intervals than Simple Random Sampling. We also use Cluster Sampling since in our data, a user corresponds to a cluster of records, one record per web activity event. Our samples are sub-samples from a larger super-sample, therefore confidence intervals for aggregate metrics based on this scheme requires slight extensions to the theory of Horvitz-Thompson estimators, usually used for confidence intervals for count and sum statistics. We develop this new theory and provide some validation. This new theory complements the application which is used daily by business users to make important business decisions
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.