11,788 research outputs found
Streaming Weighted Sampling over Join Queries
Join queries are a fundamental database tool, capturing a range of tasks that involve linking heterogeneous data sources. However, with massive table sizes, it is often impractical to keep these in memory, and we can only take one or few streaming passes over them. Moreover, building out the full join result (e.g., linking heterogeneous data sources along quasi-identifiers) can lead to a combinatorial explosion of results due to many-to-many links. Random sampling is a natural tool to boil this oversized result down to a representative subset with well-understood statistical properties, but turns out to be a challenging task due to the combinatorial nature of the sampling domain. Existing techniques in the literature focus solely on the setting with tabular data residing in main memory, and do not address aspects such as stream operation, weighted sampling and more general join operators that are urgently needed in a modern data processing context. The main contribution of this work is to meet these needs with more lightweight practical approaches. First, a bijection between the sampling problem and a graph problem is introduced to support weighted sampling and common join operators. Second, the sampling techniques are refined to minimise the number of streaming passes. Third, techniques are presented to deal with very large tables under limited memory. Finally, the proposed techniques are compared to existing approaches that rely on database indices and the results indicate substantial memory savings, reduced runtimes for ad-hoc queries and competitive amortised runtimes
Online Active Linear Regression via Thresholding
We consider the problem of online active learning to collect data for
regression modeling. Specifically, we consider a decision maker with a limited
experimentation budget who must efficiently learn an underlying linear
population model. Our main contribution is a novel threshold-based algorithm
for selection of most informative observations; we characterize its performance
and fundamental lower bounds. We extend the algorithm and its guarantees to
sparse linear regression in high-dimensional settings. Simulations suggest the
algorithm is remarkably robust: it provides significant benefits over passive
random sampling in real-world datasets that exhibit high nonlinearity and high
dimensionality --- significantly reducing both the mean and variance of the
squared error.Comment: Published in AAAI 201
Experiments : design, parametric and nonparametric analysis, and selection
Some general remarks for experimental designs are made. The general statistical methodology of analysis for some special designs is considered. Statistical tests for some specific designs under Normality assumption are indicated. Moreover, nonparametric statistical analyses for some special designs are given. The method of determining the number of observations needed in an experiment is considered in the Normal as well as in the nonparametric situation. Finally, the special topic of designing an experiment in order to select the best out of k(\geq 2) treatments is considered
- …