243 research outputs found

    GMLS-Nets: A framework for learning from unstructured data

    Full text link
    Data fields sampled on irregularly spaced points arise in many applications in the sciences and engineering. For regular grids, Convolutional Neural Networks (CNNs) have been successfully used to gaining benefits from weight sharing and invariances. We generalize CNNs by introducing methods for data on unstructured point clouds based on Generalized Moving Least Squares (GMLS). GMLS is a non-parametric technique for estimating linear bounded functionals from scattered data, and has recently been used in the literature for solving partial differential equations. By parameterizing the GMLS estimator, we obtain learning methods for operators with unstructured stencils. In GMLS-Nets the necessary calculations are local, readily parallelizable, and the estimator is supported by a rigorous approximation theory. We show how the framework may be used for unstructured physical data sets to perform functional regression to identify associated differential operators and to regress quantities of interest. The results suggest the architectures to be an attractive foundation for data-driven model development in scientific machine learning applications

    Lossless Differential Compression for Synchronizing Arbitrary Single-Dimensional Strings

    Get PDF
    Differential compression allows expressing a modified document as differences relative to another version of the document. A compressed string requires space relative to amount of changes, irrespective of original document sizes. The purpose of this study was to answer what algorithms are suitable for universal lossless differential compression for synchronizing two arbitrary documents either locally or remotely. Two main problems in differential compression are finding the differences (differencing), and compactly communicating the differences (encoding). We discussed local differencing algorithms based on subsequence searching, hashtable lookups, suffix searching, and projection. We also discussed probabilistic remote algorithms based on both recursive comparison and characteristic polynomial interpolation of hashes computed from variable-length content-defined substrings. We described various heuristics for approximating optimal algorithms as arbitrary long strings and memory limitations force discarding information. Discussion also included compact delta encoding and in-place reconstruction. We presented results from empirical testing using discussed algorithms. The conclusions were that multiple algorithms need to be integrated into a hybrid implementation, which heuristically chooses algorithms based on evaluation of the input data. Algorithms based on hashtable lookups are faster on average and require less memory, but algorithms based on suffix searching find least differences. Interpolating characteristic polynomials was found to be too slow for general use. With remote hash comparison, content-defined chunks and recursive comparison can reduce protocol overhead. A differential compressor should be merged with a state-of-art non-differential compressor to enable more compact delta encoding. Input should be processed multiple times to allow constant a space bound without significant reduction in compression efficiency. Compression efficiently of current popular synchronizers could be improved, as our empiral testing showed that a non-differential compressor produced smaller files without having access to one of the two strings

    Convergence of Smoothed Empirical Measures with Applications to Entropy Estimation

    Full text link
    This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating P∗NσP\ast\mathcal{N}_\sigma, for Nσ≜N(0,σ2Id)\mathcal{N}_\sigma\triangleq\mathcal{N}(0,\sigma^2 \mathrm{I}_d), by P^n∗Nσ\hat{P}_n\ast\mathcal{N}_\sigma, where P^n\hat{P}_n is the empirical measure, under different statistical distances. The convergence is examined in terms of the Wasserstein distance, total variation (TV), Kullback-Leibler (KL) divergence, and χ2\chi^2-divergence. We show that the approximation error under the TV distance and 1-Wasserstein distance (W1\mathsf{W}_1) converges at rate eO(d)n−12e^{O(d)}n^{-\frac{1}{2}} in remarkable contrast to a typical n−1dn^{-\frac{1}{d}} rate for unsmoothed W1\mathsf{W}_1 (and d≥3d\ge 3). For the KL divergence, squared 2-Wasserstein distance (W22\mathsf{W}_2^2), and χ2\chi^2-divergence, the convergence rate is eO(d)n−1e^{O(d)}n^{-1}, but only if PP achieves finite input-output χ2\chi^2 mutual information across the additive white Gaussian noise channel. If the latter condition is not met, the rate changes to ω(n−1)\omega(n^{-1}) for the KL divergence and W22\mathsf{W}_2^2, while the χ2\chi^2-divergence becomes infinite - a curious dichotomy. As a main application we consider estimating the differential entropy h(P∗Nσ)h(P\ast\mathcal{N}_\sigma) in the high-dimensional regime. The distribution PP is unknown but nn i.i.d samples from it are available. We first show that any good estimator of h(P∗Nσ)h(P\ast\mathcal{N}_\sigma) must have sample complexity that is exponential in dd. Using the empirical approximation results we then show that the absolute-error risk of the plug-in estimator converges at the parametric rate eO(d)n−12e^{O(d)}n^{-\frac{1}{2}}, thus establishing the minimax rate-optimality of the plug-in. Numerical results that demonstrate a significant empirical superiority of the plug-in approach to general-purpose differential entropy estimators are provided.Comment: arXiv admin note: substantial text overlap with arXiv:1810.1158

    Learning robust and efficient point cloud representations

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    EFFICIENT LAYOUTS AND ALGORITHMS FOR MANAGING VERSIONED DATASETS

    Get PDF
    Version Control Systems were primarily designed to keep track of and provide control over changes to source code and have since provided an excellent way to combat the problem of sharing and editing files in a collaborative setting. The recent surge in data-driven decision making has resulted in a proliferation of datasets elevating them to the level of source code which in turn has led the data analysts to resort to version control systems for the purpose of storing and managing datasets and their versions over time. Unfortunately existing version control systems are poor at handling large datasets primarily due to the underlying assumption that the stored files are relatively small text files with localized changes. Moreover the algorithms used by these systems tend to be fairly simple leading to suboptimal performance when applied to large datasets. In order to address the shortcomings, a key requirement here is to have a Dataset Version Control System (DVCS) that will serve as a common platform to enable data analysts to efficiently store and query dataset versions, track changes to datasets and share datasets between users at ease. Towards this goal, we address the fundamental problem of designing storage layouts for a wide range of datasets to serve as the primary building block for an efficient and scalable DVCS. The key problem in this setting is to compactly store a large number of dataset versions and efficiently retrieve any specific version (or a collection of partial versions). We initiate our study by considering storage-retrieval trade-offs for versions of unstructured dataset such as text files, blobs, etc. where the notion of a partial version is not well-defined. Next, we consider array datasets, i.e., a collection of temporal snapshots (or versions) of multi-dimensional arrays, where the data is predominantly represented in single precision or double precision format. The primary challenge here is to develop efficient compression techniques for the hard-to-compress floating point data due to the high degree of entropy. We observe that the underlying techniques developed for unstructured or array datasets are not well suited for more structured dataset versions -- a version in this setting is defined by a collection of records each of which is uniquely addressable. We carefully explore the design space for building such a system and the various storage-retrieval trade-offs, and discuss how different storage layouts influence those trade-offs. Next, we formulate several problems trading off the version storage and retrieval cost in various ways and design several offline storage layout algorithms that effectively minimize the storage costs while keeping the retrieval costs low. In addition to version retrieval queries, our system also provides support for record provenance queries. Through extensive experiments on large datasets, we demonstrate that our proposed designs can operate at the scale required in most practical scenarios

    Space/time-efficient RDF stores based on circular suffix sorting

    Full text link
    In recent years, RDF has gained popularity as a format for the standardized publication and exchange of information in the Web of Data. In this paper we introduce RDFCSA, a data structure that is able to self-index an RDF dataset in small space and supports efficient querying. RDFCSA regards the triples of the RDF store as short circular strings and applies suffix sorting on those strings, so that triple-pattern queries reduce to prefix searching on the string set. The RDF store is then represented compactly using a Compressed Suffix Array (CSA), a proved technology in text indexing that efficiently supports prefix searches. Our experimental evaluation shows that RDFCSA is able to answer triple-pattern queries in a few microseconds per result while using less than 60% of the space required by the raw original data. We also support join queries, which provide the basis for full SPARQL query support. Even though smaller-space solutions exist, as well as faster ones, RDFCSA is shown to provide an excellent space/time tradeoff, with fast and consistent query times within much less space than alternatives that compete in time.Comment: This work has been submitted to the IEEE TKDE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibl
    • …
    corecore