49,511 research outputs found
Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist
Apache Spark is a popular system aimed at the analysis of large data sets,
but recent studies have shown that certain computations---in particular, many
linear algebra computations that are the basis for solving common machine
learning problems---are significantly slower in Spark than when done using
libraries written in a high-performance computing framework such as the
Message-Passing Interface (MPI).
To remedy this, we introduce Alchemist, a system designed to call MPI-based
libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear
algebra, machine learning, and related computations, while still retaining the
benefits of working within the Spark environment. We discuss the motivation
behind the development of Alchemist, and we provide a brief overview of its
design and implementation.
We also compare the performances of pure Spark implementations with those of
Spark implementations that leverage MPI-based codes via Alchemist. To do so, we
use data science case studies: a large-scale application of the conjugate
gradient method to solve very large linear systems arising in a speech
classification problem, where we see an improvement of an order of magnitude;
and the truncated singular value decomposition (SVD) of a 400GB
three-dimensional ocean temperature data set, where we see a speedup of up to
7.9x. We also illustrate that the truncated SVD computation is easily scalable
to terabyte-sized data by applying it to data sets of sizes up to 17.6TB.Comment: Accepted for publication in Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, London, UK,
201
Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization
Today's HPC applications are producing extremely large amounts of data, such
that data storage and analysis are becoming more challenging for scientific
research. In this work, we design a new error-controlled lossy compression
algorithm for large-scale scientific data. Our key contribution is
significantly improving the prediction hitting rate (or prediction accuracy)
for each data point based on its nearby data values along multiple dimensions.
We derive a series of multilayer prediction formulas and their unified formula
in the context of data compression. One serious challenge is that the data
prediction has to be performed based on the preceding decompressed values
during the compression in order to guarantee the error bounds, which may
degrade the prediction accuracy in turn. We explore the best layer for the
prediction by considering the impact of compression errors on the prediction
accuracy. Moreover, we propose an adaptive error-controlled quantization
encoder, which can further improve the prediction hitting rate considerably.
The data size can be reduced significantly after performing the variable-length
encoding because of the uneven distribution produced by our quantization
encoder. We evaluate the new compressor on production scientific data sets and
compare it with many other state-of-the-art compressors: GZIP, FPZIP, ZFP,
SZ-1.1, and ISABELA. Experiments show that our compressor is the best in class,
especially with regard to compression factors (or bit-rates) and compression
errors (including RMSE, NRMSE, and PSNR). Our solution is better than the
second-best solution by more than a 2x increase in the compression factor and
3.8x reduction in the normalized root mean squared error on average, with
reasonable error bounds and user-desired bit-rates.Comment: Accepted by IPDPS'17, 11 pages, 10 figures, double colum
Modelling Spatial Regimes in Farms Technologies
We exploit the information derived from geographical coordinates to
endogenously identify spatial regimes in technologies that are the result of a
variety of complex, dynamic interactions among site-specific environmental
variables and farmer decision making about technology, which are often not
observed at the farm level. Controlling for unobserved heterogeneity is a
fundamental challenge in empirical research, as failing to do so can produce
model misspecification and preclude causal inference. In this article, we adopt
a two-step procedure to deal with unobserved spatial heterogeneity, while
accounting for spatial dependence in a cross-sectional setting. The first step
of the procedure takes explicitly unobserved spatial heterogeneity into account
to endogenously identify subsets of farms that follow a similar local
production econometric model, i.e. spatial production regimes. The second step
consists in the specification of a spatial autoregressive model with
autoregressive disturbances and spatial regimes. The method is applied to two
regional samples of olive growing farms in Italy. The main finding is that the
identification of spatial regimes can help drawing a more detailed picture of
the production environment and provide more accurate information to guide
extension services and policy makers
Learning Laplacian Matrix in Smooth Graph Signal Representations
The construction of a meaningful graph plays a crucial role in the success of
many graph-based representations and algorithms for handling structured data,
especially in the emerging field of graph signal processing. However, a
meaningful graph is not always readily available from the data, nor easy to
define depending on the application domain. In particular, it is often
desirable in graph signal processing applications that a graph is chosen such
that the data admit certain regularity or smoothness on the graph. In this
paper, we address the problem of learning graph Laplacians, which is equivalent
to learning graph topologies, such that the input data form graph signals with
smooth variations on the resulting topology. To this end, we adopt a factor
analysis model for the graph signals and impose a Gaussian probabilistic prior
on the latent variables that control these signals. We show that the Gaussian
prior leads to an efficient representation that favors the smoothness property
of the graph signals. We then propose an algorithm for learning graphs that
enforces such property and is based on minimizing the variations of the signals
on the learned graph. Experiments on both synthetic and real world data
demonstrate that the proposed graph learning framework can efficiently infer
meaningful graph topologies from signal observations under the smoothness
prior
- …