116 research outputs found
Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms
We present a technical survey on the state of the art approaches in data reduction and the coreset framework. These include geometric decompositions, gradient methods, random sampling, sketching and random projections. We further outline their importance for the design of streaming algorithms and give a brief overview on lower bounding techniques
Probabilistic Smallest Enclosing Ball in High Dimensions via Subgradient Sampling
We study a variant of the median problem for a collection of point sets in high dimensions. This generalizes the geometric median as well as the (probabilistic) smallest enclosing ball (pSEB) problems. Our main objective and motivation is to improve the previously best algorithm for the pSEB problem by reducing its exponential dependence on the dimension to linear. This is achieved via a novel combination of sampling techniques for clustering problems in metric spaces with the framework of stochastic subgradient descent. As a result, the algorithm becomes applicable to shape fitting problems in Hilbert spaces of unbounded dimension via kernel functions. We present an exemplary application by extending the support vector data description (SVDD) shape fitting method to the probabilistic case. This is done by simulating the pSEB algorithm implicitly in the feature space induced by the kernel function
Random projections for Bayesian regression
This article deals with random projections applied as a data reduction
technique for Bayesian regression analysis. We show sufficient conditions under
which the entire -dimensional distribution is approximately preserved under
random projections by reducing the number of data points from to in the case . Under mild
assumptions, we prove that evaluating a Gaussian likelihood function based on
the projected data instead of the original data yields a
-approximation in terms of the Wasserstein
distance. Our main result shows that the posterior distribution of Bayesian
linear regression is approximated up to a small error depending on only an
-fraction of its defining parameters. This holds when using
arbitrary Gaussian priors or the degenerate case of uniform distributions over
for . Our empirical evaluations involve different
simulated settings of Bayesian linear regression. Our experiments underline
that the proposed method is able to recover the regression model up to small
error while considerably reducing the total running time
On large-scale probabilistic and statistical data analysis
In this manuscript we develop and apply modern algorithmic data reduction techniques to tackle scalability issues and enable statistical data analysis of massive data sets. Our algorithms follow a general scheme, where a reduction technique is applied to the large-scale data to obtain a small summary of sublinear size to which a classical algorithm is applied. The techniques for obtaining these summaries depend on the problem that we want to solve. The size of the summaries is usually parametrized by an approximation parameter, expressing the trade-off between efficiency and accuracy. In some cases the data can be reduced to a size that has no or only negligible dependency on the initial number of data items. However, for other problems it turns out that sublinear summaries do not exist in the worst case. In such situations, we exploit statistical or geometric relaxations to obtain useful sublinear summaries under certain mildness assumptions. We present, in particular, the data reduction methods called coresets and subspace embeddings, and several algorithmic techniques to construct these via random projections and sampling
Streaming statistical models via Merge & Reduce
Merge & Reduce is a general algorithmic scheme in the theory of data structures. Its main purpose is to transform static data structuresâthat support only queriesâinto dynamic data structuresâthat allow insertions of new elementsâwith as little overhead as possible. This can be used to turn classic offline algorithms for summarizing and analyzing data into streaming algorithms. We transfer these ideas to the setting of statistical data analysis in streaming environments. Our approach is conceptually different from previous settings where Merge & Reduce has been employed. Instead of summarizing the data, we combine the Merge & Reduce framework directly with statistical models. This enables performing computationally demanding data analysis tasks on massive data sets. The computations are divided into small tractable batches whose size is independent of the total number of observations n. The results are combined in a structured way at the cost of a bounded O(logn) factor in their memory requirements. It is only necessary, though nontrivial, to choose an appropriate statistical model and design merge and reduce operations on a casewise basis for the specific type of model. We illustrate our Merge & Reduce schemes on simulated and real-world data employing (Bayesian) linear regression models, Gaussian mixture models and generalized linear models
Glow Discharge Optical Emission Spectrometry (GDOES), an Effectiveness Method for Characterizing Composition of Surfaces and Coatings
Within the frame of this work, the technical procedures and real advantages of using Glow Discharge Optical Emission Spectroscopy (GDOES) for establishing depth concentration profiles of surfaces are presented. GDOES can detect low concentrations with high accuracy. It can be used for either quantitative bulk analysis (QBA) or quantitative depth profiling (QDP) in the nanometer to micron range. Non-conductive and conductive samples can be analysed.
The main applications of this spectral method are related to different technology fields such as: heat treatment processes, casting, heat and cold forming processes, thermochemical treatments, electro-chemical processes (galvanic coatings), chemical and physical vapour depositions (CVD, PVD), thermal oxidation processes and anodizing, thin-films and others
Optimal Sketching Bounds for Sparse Linear Regression
We study oblivious sketching for -sparse linear regression under various
loss functions such as an norm, or from a broad class of hinge-like
loss functions, which includes the logistic and ReLU losses. We show that for
sparse norm regression, there is a distribution over oblivious
sketches with rows, which is tight up to a
constant factor. This extends to loss with an additional additive
term in the upper bound. This
establishes a surprising separation from the related sparse recovery problem,
which is an important special case of sparse regression. For this problem,
under the norm, we observe an upper bound of rows, showing that sparse recovery is
strictly easier to sketch than sparse regression. For sparse regression under
hinge-like loss functions including sparse logistic and sparse ReLU regression,
we give the first known sketching bounds that achieve rows showing that
rows suffice, where
is a natural complexity parameter needed to obtain relative error bounds for
these loss functions. We again show that this dimension is tight, up to lower
order terms and the dependence on . Finally, we show that similar
sketching bounds can be achieved for LASSO regression, a popular convex
relaxation of sparse regression, where one aims to minimize
over . We show that sketching
dimension suffices and that the dependence
on and is tight.Comment: AISTATS 202
- âŠ