Search CORE

155 research outputs found

Structure-Aware Sampling: Flexible and Accurate Summarization

Author: Cohen Edith
Cormode Graham
Duffield Nick
Publication venue
Publication date: 01/01/2011
Field of study

In processing large quantities of data, a fundamental problem is to obtain a summary which supports approximate query answering. Random sampling yields flexible summaries which naturally support subset-sum queries with unbiased estimators and well-understood confidence bounds. Classic sample-based summaries, however, are designed for arbitrary subset queries and are oblivious to the structure in the set of keys. The particular structure, such as hierarchy, order, or product space (multi-dimensional), makes range queries much more relevant for most analysis of the data. Dedicated summarization algorithms for range-sum queries have also been extensively studied. They can outperform existing sampling schemes in terms of accuracy on range queries per summary size. Their accuracy, however, rapidly degrades when, as is often the case, the query spans multiple ranges. They are also less flexible - being targeted for range sum queries alone - and are often quite costly to build and use. In this paper we propose and evaluate variance optimal sampling schemes that are structure-aware. These summaries improve over the accuracy of existing structure-oblivious sampling schemes on range queries while retaining the benefits of sample-based summaries: flexible summaries, with high accuracy on both range queries and arbitrary subset queries

arXiv.org e-Print Archive

CiteSeerX

Doctor of Philosophy

Author: Jestes Jeffrey
Publication venue: University of Utah
Publication date: 01/12/2013
Field of study

dissertationWe are living in an age where data are being generated faster than anyone has previously imagined across a broad application domain, including customer studies, social media, sensor networks, and the sciences, among many others. In some cases, data are generated in massive quantities as terabytes or petabytes. There have been numerous emerging challenges when dealing with massive data, including: (1) the explosion in size of data; (2) data have increasingly more complex structures and rich semantics, such as representing temporal data as a piecewise linear representation; (3) uncertain data are becoming a common occurrence for numerous applications, e.g., scientific measurements or observations such as meteorological measurements; (4) and data are becoming increasingly distributed, e.g., distributed data collected and integrated from distributed locations as well as data stored in a distributed file system within a cluster. Due to the massive nature of modern data, it is oftentimes infeasible for computers to efficiently manage and query them exactly. An attractive alternative is to use data summarization techniques to construct data summaries, where even efficiently constructing data summaries is a challenging task given the enormous size of data. The data summaries we focus on in this thesis include the histogram and ranking operator. Both data summaries enable us to summarize a massive dataset to a more succinct representation which can then be used to make queries orders of magnitude more efficient while still allowing approximation guarantees on query answers. Our study has focused on the critical task of designing efficient algorithms to summarize, query, and manage massive data

The Incremental Multiresolution Matrix Factorization Algorithm

Author: Ithapu Vamsi K.
Johnson Sterling C.
Kondor Risi
Singh Vikas
Publication venue
Publication date: 16/05/2017
Field of study

Multiresolution analysis and matrix factorization are foundational tools in computer vision. In this work, we study the interface between these two distinct topics and obtain techniques to uncover hierarchical block structure in symmetric matrices -- an important aspect in the success of many vision problems. Our new algorithm, the incremental multiresolution matrix factorization, uncovers such structure one feature at a time, and hence scales well to large matrices. We describe how this multiscale analysis goes much farther than what a direct global factorization of the data can identify. We evaluate the efficacy of the resulting factorizations for relative leveraging within regression tasks using medical imaging data. We also use the factorization on representations learned by popular deep networks, providing evidence of their ability to infer semantic relationships even when they are not explicitly trained to do so. We show that this algorithm can be used as an exploratory tool to improve the network architecture, and within numerous other settings in vision.Comment: Computer Vision and Pattern Recognition (CVPR) 2017, 10 page

arXiv.org e-Print Archive

Building wavelet histograms on large data in MapReduce

Author: Abouzeid A.
Afrati F. N.
Aggarwal C. C.
Alon N.
Arlitt M.
Cao P.
Chaiken R.
Chakrabarti K.
Cohen J.
Condie T.
Condie T.
Cormode G.
Cormode G.
Dean J.
Dittrich J.
Garofalakis M.
Gates A. F.
Gilbert A. C.
Guha S.
Huang Z.
Jagadish H. V.
Jiang D.
Matias Y.
Matias Y.
Michel S.
Olston C.
Patt-Shamir B.
Pavlo A.
Poosala V.
Son M.
Srinivasan R.
Thusoo A.
Vapnik V. N.
Vernica R.
Zhao Q.
Publication venue: 'VLDB Endowment'
Publication date
Field of study

Sparse recovery using sparse matrices

Author: Piotr Indyk
Piotr Indyk
Radu Berinde
Radu Berinde
Publication venue
Publication date: 01/01/2008
Field of study

We consider the approximate sparse recovery problem, where the goal is to (approximately) recover a high-dimensional vector x from its lower-dimensional sketch Ax. A popular way of performing this recovery is by finding x* such that Ax=Ax*, and ||x*||_1 is minimal. It is known that this approach ``works'' if A is a random *dense* matrix, chosen from a proper distribution.In this paper, we investigate this procedure for the case where A is binary and *very sparse*. We show that, both in theory and in practice, sparse matrices are essentially as ``good'' as the dense ones. At the same time, sparse binary matrices provide additional benefits, such as reduced encoding and decoding time

CiteSeerX