3 research outputs found
Subsampling in Smoothed Range Spaces
We consider smoothed versions of geometric range spaces, so an element of the
ground set (e.g. a point) can be contained in a range with a non-binary value
in . Similar notions have been considered for kernels; we extend them to
more general types of ranges. We then consider approximations of these range
spaces through -nets and -samples (aka
-approximations). We characterize when size bounds for
-samples on kernels can be extended to these more general
smoothed range spaces. We also describe new generalizations for -nets to these range spaces and show when results from binary range spaces can
carry over to these smoothed ones.Comment: This is the full version of the paper which appeared in ALT 2015. 16
pages, 3 figures. In Algorithmic Learning Theory, pp. 224-238. Springer
International Publishing, 201
Visualization of Big Spatial Data using Coresets for Kernel Density Estimates
The size of large, geo-located datasets has reached scales where
visualization of all data points is inefficient. Random sampling is a method to
reduce the size of a dataset, yet it can introduce unwanted errors. We describe
a method for subsampling of spatial data suitable for creating kernel density
estimates from very large data and demonstrate that it results in less error
than random sampling. We also introduce a method to ensure that thresholding of
low values based on sampled data does not omit any regions above the desired
threshold when working with sampled data. We demonstrate the effectiveness of
our approach using both, artificial and real-world large geospatial datasets
Doctor of Philosophy
dissertationKernel smoothing provides a simple way of finding structures in data sets without the imposition of a parametric model, for example, nonparametric regression and density estimates. However, in many data-intensive applications, the data set could be large. Thus, evaluating a kernel density estimate or kernel regression over the data set directly can be prohibitively expensive in big data. This dissertation is working on how to efficiently find a smaller data set that can approximate the original data set with a theoretical guarantee in the kernel smoothing setting and how to extend it to more general smooth range spaces. For kernel density estimates, we propose randomized and deterministic algorithms with quality guarantees that are orders of magnitude more efficient than previous algorithms, which do not require knowledge of the kernel or its bandwidth parameter and are easily parallelizable. Our algorithms are applicable to any large-scale data processing framework. We then further investigate how to measure the error between two kernel density estimates, which is usually measured either in L1 or L2 error. In this dissertation, we investigate the challenges in using a stronger error, L ∞ (or worst case) error. We present efficient solutions for how to estimate the L∞ error and how to choose the bandwidth parameter for a kernel density estimate built on a subsample of a large data set. We next extend smoothed versions of geometric range spaces from kernel range spaces to more general types of ranges, so that an element of the ground set can be contained in a range with a non-binary value in [0,1]. We investigate the approximation of these range spaces through ϵ-nets and ϵ-samples. Finally, we study coresets algorithms for kernel regression. The size of the coresets are independent of the size of the data set, rather they only depend on the error guarantee, and in some cases the size of domain and amount of smoothing. We evaluate our methods on very large time series and spatial data, demonstrate that they can be constructed extremely efficiently, and allow for great computational gains