243 research outputs found
GMLS-Nets: A framework for learning from unstructured data
Data fields sampled on irregularly spaced points arise in many applications
in the sciences and engineering. For regular grids, Convolutional Neural
Networks (CNNs) have been successfully used to gaining benefits from weight
sharing and invariances. We generalize CNNs by introducing methods for data on
unstructured point clouds based on Generalized Moving Least Squares (GMLS).
GMLS is a non-parametric technique for estimating linear bounded functionals
from scattered data, and has recently been used in the literature for solving
partial differential equations. By parameterizing the GMLS estimator, we obtain
learning methods for operators with unstructured stencils. In GMLS-Nets the
necessary calculations are local, readily parallelizable, and the estimator is
supported by a rigorous approximation theory. We show how the framework may be
used for unstructured physical data sets to perform functional regression to
identify associated differential operators and to regress quantities of
interest. The results suggest the architectures to be an attractive foundation
for data-driven model development in scientific machine learning applications
Lossless Differential Compression for Synchronizing Arbitrary Single-Dimensional Strings
Differential compression allows expressing a modified document as differences relative to another version of the document. A compressed string requires space relative to amount of changes, irrespective of original document sizes. The purpose of this study was to answer what algorithms are suitable for universal lossless differential compression for synchronizing two arbitrary documents either locally or remotely.
Two main problems in differential compression are finding the differences (differencing), and compactly communicating the differences (encoding). We discussed local differencing algorithms based on subsequence searching, hashtable lookups, suffix searching, and projection. We also discussed probabilistic remote algorithms based on both recursive comparison and characteristic polynomial interpolation of hashes computed from variable-length content-defined substrings. We described various heuristics for approximating optimal algorithms as arbitrary long strings and memory limitations force discarding information. Discussion also included compact delta encoding and in-place reconstruction. We presented results from empirical testing using discussed algorithms.
The conclusions were that multiple algorithms need to be integrated into a hybrid implementation, which heuristically chooses algorithms based on evaluation of the input data. Algorithms based on hashtable lookups are faster on average and require less memory, but algorithms based on suffix searching find least differences. Interpolating characteristic polynomials was found to be too slow for general use. With remote hash comparison, content-defined chunks and recursive comparison can reduce protocol overhead. A differential compressor should be merged with a state-of-art non-differential compressor to enable more compact delta encoding. Input should be processed multiple times to allow constant a space bound without significant reduction in compression efficiency. Compression efficiently of current popular synchronizers could be improved, as our empiral testing showed that a non-differential compressor produced smaller files without having access to one of the two strings
Convergence of Smoothed Empirical Measures with Applications to Entropy Estimation
This paper studies convergence of empirical measures smoothed by a Gaussian
kernel. Specifically, consider approximating , for
, by
, where is the empirical measure,
under different statistical distances. The convergence is examined in terms of
the Wasserstein distance, total variation (TV), Kullback-Leibler (KL)
divergence, and -divergence. We show that the approximation error under
the TV distance and 1-Wasserstein distance () converges at rate
in remarkable contrast to a typical
rate for unsmoothed (and ). For the
KL divergence, squared 2-Wasserstein distance (), and
-divergence, the convergence rate is , but only if
achieves finite input-output mutual information across the additive
white Gaussian noise channel. If the latter condition is not met, the rate
changes to for the KL divergence and , while
the -divergence becomes infinite - a curious dichotomy. As a main
application we consider estimating the differential entropy
in the high-dimensional regime. The distribution
is unknown but i.i.d samples from it are available. We first show that
any good estimator of must have sample complexity
that is exponential in . Using the empirical approximation results we then
show that the absolute-error risk of the plug-in estimator converges at the
parametric rate , thus establishing the minimax
rate-optimality of the plug-in. Numerical results that demonstrate a
significant empirical superiority of the plug-in approach to general-purpose
differential entropy estimators are provided.Comment: arXiv admin note: substantial text overlap with arXiv:1810.1158
Learning robust and efficient point cloud representations
L'abstract è presente nell'allegato / the abstract is in the attachmen
EFFICIENT LAYOUTS AND ALGORITHMS FOR MANAGING VERSIONED DATASETS
Version Control Systems were primarily designed to keep track of and provide control over changes to source code and have since provided an excellent way to combat the problem of sharing and editing files in a collaborative setting. The recent surge in data-driven decision making has resulted in a proliferation of datasets elevating them to the level of source code which in turn has led the data analysts to resort to version control systems for the purpose of storing and managing datasets and their versions over time. Unfortunately existing version control systems are poor at handling large datasets primarily due to the underlying assumption that the stored files are relatively small text files with localized changes. Moreover the algorithms used by these systems tend to be fairly simple leading to suboptimal performance when applied to large datasets. In order to address the shortcomings, a key requirement here is to have a Dataset Version Control System (DVCS) that will serve as a common platform to enable data analysts to efficiently store and query dataset versions, track changes to datasets and share datasets between users at ease.
Towards this goal, we address the fundamental problem of designing storage layouts for a wide range of datasets to serve as the primary building block for an efficient and scalable DVCS. The key problem in this setting is to compactly store a large number of dataset versions and efficiently retrieve any specific version (or a collection of partial versions). We initiate our study by considering storage-retrieval trade-offs for versions of unstructured dataset such as text files, blobs, etc. where the notion of a partial version is not well-defined. Next, we consider array datasets, i.e., a collection of temporal snapshots (or versions) of multi-dimensional arrays, where the data is predominantly represented in single precision or double precision format. The primary challenge here is to develop efficient compression techniques for the hard-to-compress floating point data due to the high degree of entropy. We observe that the underlying techniques developed for unstructured or array datasets are not well suited for more structured dataset versions -- a version in this setting is defined by a collection of records each of which is uniquely addressable. We carefully explore the design space for building such a system and the various storage-retrieval trade-offs, and discuss how different storage layouts influence those trade-offs. Next, we formulate several problems trading off the version storage and retrieval cost in various ways and design several offline storage layout algorithms that effectively minimize the storage costs while keeping the retrieval costs low. In addition to version retrieval queries, our system also provides support for record provenance queries. Through extensive experiments on large datasets, we demonstrate that our proposed designs can operate at the scale required in most practical scenarios
Space/time-efficient RDF stores based on circular suffix sorting
In recent years, RDF has gained popularity as a format for the standardized
publication and exchange of information in the Web of Data. In this paper we
introduce RDFCSA, a data structure that is able to self-index an RDF dataset in
small space and supports efficient querying. RDFCSA regards the triples of the
RDF store as short circular strings and applies suffix sorting on those
strings, so that triple-pattern queries reduce to prefix searching on the
string set. The RDF store is then represented compactly using a Compressed
Suffix Array (CSA), a proved technology in text indexing that efficiently
supports prefix searches. Our experimental evaluation shows that RDFCSA is able
to answer triple-pattern queries in a few microseconds per result while using
less than 60% of the space required by the raw original data. We also support
join queries, which provide the basis for full SPARQL query support. Even
though smaller-space solutions exist, as well as faster ones, RDFCSA is shown
to provide an excellent space/time tradeoff, with fast and consistent query
times within much less space than alternatives that compete in time.Comment: This work has been submitted to the IEEE TKDE for possible
publication. Copyright may be transferred without notice, after which this
version may no longer be accessibl
- …