Search CORE

52 research outputs found

Doctor of Philosophy

Author: Daruki Samira
Publication venue: University of Utah
Publication date: 01/01/2018
Field of study

dissertationThe contributions of this dissertation are centered around designing new algorithms in the general area of sublinear algorithms such as streaming, core sets and sublinear verification, with a special interest in problems arising from data analysis including data summarization, clustering, matrix problems and massive graphs. In the first part, we focus on summaries and coresets, which are among the main techniques for designing sublinear algorithms for massive data sets. We initiate the study of coresets for uncertain data and study coresets for various types of range counting queries on uncertain data. We focus mainly on the indecisive model of locational uncertainty since it comes up frequently in real-world applications when multiple readings of the same object are made. In this model, each uncertain point has a probability density describing its location, defined as

k

distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by examining only this subset. For each type of query we provide coreset constructions with approximation-size trade-offs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancy-based techniques on axis-aligned range queries. In the second part, we focus on designing sublinear-space algorithms for approximate computations on massive graphs. In particular, we consider graph MAXCUT and correlation clustering problems and develop sampling based approaches to construct truly sublinear (

o(n)

) sized coresets for graphs that have polynomial (i.e.,

n^{\delta}

for any

\delta >0

) average degree. Our technique is based on analyzing properties of random induced subprograms of the linear program formulations of the problems. We demonstrate this technique with two examples. Firstly, we present a sublinear sized core set to approximate the value of the MAX CUT in a graph to a

(1+\epsilon)

factor. To the best of our knowledge, all the known methods in this regime rely crucially on near-regularity assumptions. Secondly, we apply the same framework to construct a sublinear-sized coreset for correlation clustering. Our coreset construction also suggests 2-pass streaming algorithms for computing the MAX CUT and correlation clustering objective values which are left as future work at the time of writing this dissertation. Finally, we focus on streaming verification algorithms as another model for designing sublinear algorithms. We give the first polylog space and sublinear (in number of edges) communication protocols for any streaming verification problems in graphs. We present efficient streaming interactive proofs that can verify maximum matching exactly. Our results cover all flavors of matchings (bipartite/ nonbipartite and weighted). In addition, we also present streaming verifiers for approximate metric TSP and exact triangle counting, as well as for graph primitives such as the number of connected components, bipartiteness, minimum spanning tree and connectivity. In particular, these are the first results for weighted matchings and for metric TSP in any streaming verification model. Our streaming verifiers use only polylogarithmic space while exchanging only polylogarithmic communication with the prover in addition to the output size of the relevant solution. We also initiate a study of streaming interactive proofs (SIPs) for problems in data analysis and present efficient SIPs for some fundamental problems. We present protocols for clustering and shape fitting including minimum enclosing ball (MEB), width of a point set,

k

-centers and

k

-slab problem. We also present protocols for fundamental matrix analysis problems: We provide an improved protocol for rectangular matrix problems, which in turn can be used to verify

k

(approximate) eigenvectors of an

n \times n

integer matrix

A

. In general our solutions use polylogarithmic rounds of communication and polylogarithmic total communication and verifier space

The University of Utah: J. Willard Marriott Digital Library

On the expected diameter, width, and complexity of a stochastic convex-hull

Author: A Jørgensen
C Li
L Huang
M Löffler
P Kamousi
PK Agarwal
PK Agarwal
S Suri
S Suri
Publication venue
Publication date: 01/05/2017
Field of study

We investigate several computational problems related to the stochastic convex hull (SCH). Given a stochastic dataset consisting of

n

points in

\mathbb{R}^d

each of which has an existence probability, a SCH refers to the convex hull of a realization of the dataset, i.e., a random sample including each point with its existence probability. We are interested in computing certain expected statistics of a SCH, including diameter, width, and combinatorial complexity. For diameter, we establish the first deterministic 1.633-approximation algorithm with a time complexity polynomial in both

n

and

d

. For width, two approximation algorithms are provided: a deterministic

O(1)

-approximation running in

O(n^{d+1} \log n)

time, and a fully polynomial-time randomized approximation scheme (FPRAS). For combinatorial complexity, we propose an exact

O(n^d)

-time algorithm. Our solutions exploit many geometric insights in Euclidean space, some of which might be of independent interest

arXiv.org e-Print Archive

Crossref

ε-Kernel Coresets for Stochastic Points

Author: Huang Lingxiao
Li Jian
Phillips Jeff Mark
Wang Haitao
Publication venue: Hosted by Utah State University Libraries
Publication date: 01/01/2016
Field of study

With the dramatic growth in the number of application domains that generate probabilistic, noisy and uncertain data, there has been an increasing interest in designing algorithms for geometric or combinatorial optimization problems over such data. In this paper, we initiate the study of constructing epsilon-kernel coresets for uncertain points. We consider uncertainty in the existential model where each point\u27s location is fixed but only occurs with a certain probability, and the locational model where each point has a probability distribution describing its location. An epsilon-kernel coreset approximates the width of a point set in any direction. We consider approximating the expected width (an ε-EXP-KERNEL), as well as the probability distribution on the width (an (ε, tau)-QUANT-KERNEL) for any direction. We show that there exists a set of O(ε^{-(d-1)/2}) deterministic points which approximate the expected width under the existential and locational models, and we provide efficient algorithms for constructing such coresets. We show, however, it is not always possible to find a subset of the original uncertain points which provides such an approximation. However, if the existential probability of each point is lower bounded by a constant, an ε-EXP-KERNEL is still possible. We also provide efficient algorithms for construct an (ε, τ)-QUANT-KERNEL coreset in nearly linear time. Our techniques utilize or connect to several important notions in probability and geometry, such as Kolmogorov distances, VC uniform convergence and Tukey depth, and may be useful in other geometric optimization problem in stochastic settings. Finally, combining with known techniques, we show a few applications to approximating the extent of uncertain functions, maintaining extent measures for stochastic moving points and some shape fitting problems under uncertainty

arXiv.org e-Print Archive

DigitalCommons@USU

Dagstuhl Research Online Publication Server

Range Queries on Uncertain Data

Author: B Chazelle
B Chazelle
B Chazelle
B Chazelle
G Frederickson
J Driscoll
J Mitchell
M Yiu
P Agarwal
Publication venue
Publication date: 09/01/2015
Field of study

Given a set

P

n

uncertain points on the real line, each represented by its one-dimensional probability density function, we consider the problem of building data structures on

P

to answer range queries of the following three types for any query interval

I

: (1) top-

1

query: find the point in

P

that lies in

I

with the highest probability, (2) top-

k

query: given any integer

k\leq n

as part of the query, return the

k

points in

P

that lie in

I

with the highest probabilities, and (3) threshold query: given any threshold

\tau

as part of the query, return all points of

P

that lie in

I

with probabilities at least

\tau

. We present data structures for these range queries with linear or nearly linear space and efficient query time.Comment: 26 pages. A preliminary version of this paper appeared in ISAAC 2014. In this full version, we also present solutions to the most general case of the problem (i.e., the histogram bounded case), which were left as open problems in the preliminary versio

arXiv.org e-Print Archive

Crossref

Approximating the Distribution of the Median and other Robust Estimators on Uncertain Data

Author: Buchin Kevin
Phillips Jeff M.
Tang Pingfan
Publication venue
Publication date: 01/01/2018
Field of study

Robust estimators, like the median of a point set, are important for data analysis in the presence of outliers. We study robust estimators for locationally uncertain points with discrete distributions. That is, each point in a data set has a discrete probability distribution describing its location. The probabilistic nature of uncertain data makes it challenging to compute such estimators, since the true value of the estimator is now described by a distribution rather than a single point. We show how to construct and estimate the distribution of the median of a point set. Building the approximate support of the distribution takes near-linear time, and assigning probability to that support takes quadratic time. We also develop a general approximation technique for distributions of robust estimators with respect to ranges with bounded VC dimension. This includes the geometric median for high dimensions and the Siegel estimator for linear regression.Comment: Full version of a paper to appear at SoCG 201

arXiv.org e-Print Archive

Pure OAI Repository

Dagstuhl Research Online Publication Server