92 research outputs found

    Multi-Resolution Hashing for Fast Pairwise Summations

    Full text link
    A basic computational primitive in the analysis of massive datasets is summing simple functions over a large number of objects. Modern applications pose an additional challenge in that such functions often depend on a parameter vector yy (query) that is unknown a priori. Given a set of points XβŠ‚RdX\subset \mathbb{R}^{d} and a pairwise function w:RdΓ—Rdβ†’[0,1]w:\mathbb{R}^{d}\times \mathbb{R}^{d}\to [0,1], we study the problem of designing a data-structure that enables sublinear-time approximation of the summation Zw(y)=1∣Xβˆ£βˆ‘x∈Xw(x,y)Z_{w}(y)=\frac{1}{|X|}\sum_{x\in X}w(x,y) for any query y∈Rdy\in \mathbb{R}^{d}. By combining ideas from Harmonic Analysis (partitions of unity and approximation theory) with Hashing-Based-Estimators [Charikar, Siminelakis FOCS'17], we provide a general framework for designing such data structures through hashing that reaches far beyond what previous techniques allowed. A key design principle is a collection of Tβ‰₯1T\geq 1 hashing schemes with collision probabilities p1,…,pTp_{1},\ldots, p_{T} such that sup⁑t∈[T]{pt(x,y)}=Θ(w(x,y))\sup_{t\in [T]}\{p_{t}(x,y)\} = \Theta(\sqrt{w(x,y)}). This leads to a data-structure that approximates Zw(y)Z_{w}(y) using a sub-linear number of samples from each hash family. Using this new framework along with Distance Sensitive Hashing [Aumuller, Christiani, Pagh, Silvestri PODS'18], we show that such a collection can be constructed and evaluated efficiently for any log-convex function w(x,y)=eΟ•(⟨x,y⟩)w(x,y)=e^{\phi(\langle x,y\rangle)} of the inner product on the unit sphere x,y∈Sdβˆ’1x,y\in \mathcal{S}^{d-1}. Our method leads to data structures with sub-linear query time that significantly improve upon random sampling and can be used for Kernel Density or Partition Function Estimation. We provide extensions of our result from the sphere to Rd\mathbb{R}^{d} and from scalar functions to vector functions.Comment: 39 pages, 3 figure

    Methods of Uncertainty Quantification for Physical Parameters

    Get PDF
    Uncertainty Quantification (UQ) is an umbrella term referring to a broad class of methods which typically involve the combination of computational modeling, experimental data and expert knowledge to study a physical system. A parameter, in the usual statistical sense, is said to be physical if it has a meaningful interpretation with respect to the physical system. Physical parameters can be viewed as inherent properties of a physical process and have a corresponding true value. Statistical inference for physical parameters is a challenging problem in UQ due to the inadequacy of the computer model. In this thesis, we provide a comprehensive overview of the existing relevant UQ methodology. The computer model is often time consuming, proprietary or classified and therefore a cheap-to-evaluate emulator is needed. When the input space is large, Gaussian process (GP) emulation may be infeasible and the predominant local GP framework is too slow for prediction when MCMC is used for posterior sampling. We propose two modifications to this LA-GP framework which can be used to construct a cheap-to-evaluate emulator for the computer model, offering the user a simple and flexible time for memory exchange. When the field data consist of measurements across a set of experiments, it is common for a set of computer model inputs to represent measurements of a physical component, recorded with error. When this structure is present, we propose a new metric for identifying overfitting and a related regularization prior distribution. We show that these parameters lead to improved inference for compressibility parameters of tantalum. We propose an approximate Bayesian framework, referred to as modularization, which is shown to be useful for exploring dependencies between physical and nuisance parameters, with respect to the inadequacy of the computer model and the available prior information. We discuss a cross validation framework, modified to account for spatial (or temporal) structure, and show that it can aid in the construction of empirical Bayes priors for the model discrepancy. This CV framework can be coupled with modularization to assess the sensitivity of physical parameters to the discrepancy related modeling choices

    Scalable Nearest Neighbor Search with Compact Codes

    Get PDF
    An important characteristic of the recent decade is the dramatic growth in the use and generation of data. From collections of images, documents and videos, to genetic data, and to network traffic statistics, modern technologies and cheap storage have made it possible to accumulate huge datasets. But how can we effectively use all this data? The growing sizes of the modern datasets make it crucial to develop new algorithms and tools capable of sifting through this data efficiently. A central computational primitive for analyzing large datasets is the Nearest Neighbor Search problem in which the goal is to preprocess a set of objects, so that later, given a query object, one can find the data object closest to the query. In most situations involving high-dimensional objects, the exhaustive search which compares the query with every item in the dataset has a prohibitive cost both for runtime and memory space. This thesis focuses on the design of algorithms and tools for fast and cost efficient nearest neighbor search. The proposed techniques advocate the use of compressed and discrete codes for representing the neighborhood structure of data in a compact way. Transforming high-dimensional items, such as raw images, into similarity-preserving compact codes has both computational and storage advantages as compact codes can be stored efficiently using only a few bits per data item, and more importantly they can be compared extremely fast using bit-wise or look-up table operators. Motivated by this view, the present work explores two main research directions: 1) finding mappings that better preserve the given notion of similarity while keeping the codes as compressed as possible, and 2) building efficient data structures that support non-exhaustive search among the compact codes. Our large-scale experimental results reported on various benchmarks including datasets upto one billion items, show boost in retrieval performance in comparison to the state-of-the-art
    • …
    corecore