Search CORE

3 research outputs found

Subsampling in Smoothed Range Spaces

Author: B Aronov
B Chazelle
D Haussler
J Beck
J Edmonds
J Matoušek
J Pach
N Alon
V Vapnik
Y Li
Publication venue
Publication date: 30/10/2015
Field of study

We consider smoothed versions of geometric range spaces, so an element of the ground set (e.g. a point) can be contained in a range with a non-binary value in

[0,1]

. Similar notions have been considered for kernels; we extend them to more general types of ranges. We then consider approximations of these range spaces through

\varepsilon

-nets and

\varepsilon

-samples (aka

\varepsilon

-approximations). We characterize when size bounds for

\varepsilon

-samples on kernels can be extended to these more general smoothed range spaces. We also describe new generalizations for

\varepsilon

-nets to these range spaces and show when results from binary range spaces can carry over to these smoothed ones.Comment: This is the full version of the paper which appeared in ALT 2015. 16 pages, 3 figures. In Algorithmic Learning Theory, pp. 224-238. Springer International Publishing, 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Visualization of Big Spatial Data using Coresets for Kernel Density Estimates

Author: Lex Alexander
Ou Yi
Phillips Jeff M.
Zheng Yan
Publication venue
Publication date: 13/09/2017
Field of study

The size of large, geo-located datasets has reached scales where visualization of all data points is inefficient. Random sampling is a method to reduce the size of a dataset, yet it can introduce unwanted errors. We describe a method for subsampling of spatial data suitable for creating kernel density estimates from very large data and demonstrate that it results in less error than random sampling. We also introduce a method to ensure that thresholding of low values based on sampled data does not omit any regions above the desired threshold when working with sampled data. We demonstrate the effectiveness of our approach using both, artificial and real-world large geospatial datasets

arXiv.org e-Print Archive

Crossref

Doctor of Philosophy

Author: Zheng Yan
Publication venue: University of Utah
Publication date: 01/01/2017
Field of study

dissertationKernel smoothing provides a simple way of finding structures in data sets without the imposition of a parametric model, for example, nonparametric regression and density estimates. However, in many data-intensive applications, the data set could be large. Thus, evaluating a kernel density estimate or kernel regression over the data set directly can be prohibitively expensive in big data. This dissertation is working on how to efficiently find a smaller data set that can approximate the original data set with a theoretical guarantee in the kernel smoothing setting and how to extend it to more general smooth range spaces. For kernel density estimates, we propose randomized and deterministic algorithms with quality guarantees that are orders of magnitude more efficient than previous algorithms, which do not require knowledge of the kernel or its bandwidth parameter and are easily parallelizable. Our algorithms are applicable to any large-scale data processing framework. We then further investigate how to measure the error between two kernel density estimates, which is usually measured either in L1 or L2 error. In this dissertation, we investigate the challenges in using a stronger error, L ∞ (or worst case) error. We present efficient solutions for how to estimate the L∞ error and how to choose the bandwidth parameter for a kernel density estimate built on a subsample of a large data set. We next extend smoothed versions of geometric range spaces from kernel range spaces to more general types of ranges, so that an element of the ground set can be contained in a range with a non-binary value in [0,1]. We investigate the approximation of these range spaces through ϵ-nets and ϵ-samples. Finally, we study coresets algorithms for kernel regression. The size of the coresets are independent of the size of the data set, rather they only depend on the error guarantee, and in some cases the size of domain and amount of smoothing. We evaluate our methods on very large time series and spatial data, demonstrate that they can be constructed extremely efficiently, and allow for great computational gains

The University of Utah: J. Willard Marriott Digital Library