23 research outputs found
numpywren: serverless linear algebra
Linear algebra operations are widely used in scientific computing and machine
learning applications. However, it is challenging for scientists and data
analysts to run linear algebra at scales beyond a single machine. Traditional
approaches either require access to supercomputing clusters, or impose
configuration and cluster management challenges. In this paper we show how the
disaggregation of storage and compute resources in so-called "serverless"
environments, combined with compute-intensive workload characteristics, can be
exploited to achieve elastic scalability and ease of management.
We present numpywren, a system for linear algebra built on a serverless
architecture. We also introduce LAmbdaPACK, a domain-specific language designed
to implement highly parallel linear algebra algorithms in a serverless setting.
We show that, for certain linear algebra algorithms such as matrix multiply,
singular value decomposition, and Cholesky decomposition, numpywren's
performance (completion time) is within 33% of ScaLAPACK, and its compute
efficiency (total CPU-hours) is up to 240% better due to elasticity, while
providing an easier to use interface and better fault tolerance. At the same
time, we show that the inability of serverless runtimes to exploit locality
across the cores in a machine fundamentally limits their network efficiency,
which limits performance on other algorithms such as QR factorization. This
highlights how cloud providers could better support these types of computations
through small changes in their infrastructure
Spatio-Temporal Surrogates for Interaction of a Jet with High Explosives: Part I -- Analysis with a Small Sample Size
Computer simulations, especially of complex phenomena, can be expensive,
requiring high-performance computing resources. Often, to understand a
phenomenon, multiple simulations are run, each with a different set of
simulation input parameters. These data are then used to create an interpolant,
or surrogate, relating the simulation outputs to the corresponding inputs. When
the inputs and outputs are scalars, a simple machine learning model can
suffice. However, when the simulation outputs are vector valued, available at
locations in two or three spatial dimensions, often with a temporal component,
creating a surrogate is more challenging. In this report, we use a
two-dimensional problem of a jet interacting with high explosives to understand
how we can build high-quality surrogates. The characteristics of our data set
are unique - the vector-valued outputs from each simulation are available at
over two million spatial locations; each simulation is run for a relatively
small number of time steps; the size of the computational domain varies with
each simulation; and resource constraints limit the number of simulations we
can run. We show how we analyze these extremely large data-sets, set the
parameters for the algorithms used in the analysis, and use simple ways to
improve the accuracy of the spatio-temporal surrogates without substantially
increasing the number of simulations required
Using reconfigurable computing technology to accelerate matrix decomposition and applications
Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications.
The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions:
• We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices.
• We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices.
• We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each.
• We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns.
• By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture.
• We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update.
Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time
Recommended from our members
Randomized Methods for Computing Low-Rank Approximations of Matrices
Randomized sampling techniques have recently proved capable of efficiently solving many standard problems in linear algebra, and enabling computations at scales far larger than what was previously possible. The new algorithms are designed from the bottom up to perform well in modern computing environments where the expense of communication is the primary constraint. In extreme cases, the algorithms can even be made to work in a streaming environment where the matrix is not stored at all, and each element can be seen only once. The dissertation describes a set of randomized techniques for rapidly constructing a low-rank ap- proximation to a matrix. The algorithms are presented in a modular framework that first computes an approximation to the range of the matrix via randomized sampling. Secondly, the matrix is pro- jected to the approximate range, and a factorization (SVD, QR, LU, etc.) of the resulting low-rank matrix is computed via variations of classical deterministic methods. Theoretical performance bounds are provided. Particular attention is given to very large scale computations where the matrix does not fit in RAM on a single workstation. Algorithms are developed for the case where the original matrix must be stored out-of-core but where the factors of the approximation fit in RAM. Numerical examples are provided that perform Principal Component Analysis of a data set that is so large that less than one hundredth of it can fit in the RAM of a standard laptop computer. Furthermore, the dissertation presents a parallelized randomized scheme for computing a reduced rank Singular Value Decomposition. By parallelizing and distributing both the randomized sampling stage and the processing of the factors in the approximate factorization, the method requires an amount of memory per node which is independent of both dimensions of the input matrix. Numerical experiments are performed on Hadoop clusters of computers in Amazon\u27s Elastic Compute Cloud with up to 64 total cores. Finally, we directly compare the performance and accuracy of the randomized algorithm with the classical Lanczos method on extremely large, sparse matrices and substantiate the claim that randomized methods are superior in this environment
FlashX: Massive Data Analysis Using Fast I/O
With the explosion of data and the increasing complexity of data analysis, large-scale
data analysis imposes significant challenges in systems design. While current
research focuses on scaling out to large clusters, these scale-out solutions introduce
a significant amount of overhead. This thesis is motivated by the advance of new
I/O technologies such as flash memory. Instead of scaling out, we explore efficient
system designs in a single commodity machine with non-uniform memory architecture
(NUMA) and scale to large datasets by utilizing commodity solid-state drives
(SSDs). This thesis explores the impact of the new I/O technologies on large-scale
data analysis. Instead of implementing individual data analysis algorithms for SSDs,
we develop a data analysis ecosystem called FlashX to target a large range of data
analysis tasks. FlashX includes three subsystems: SAFS, FlashGraph and FlashMatrix.
SAFS is a user-space filesystem optimized for a large SSD array to deliver
maximal I/O throughput from SSDs. FlashGraph is a general-purpose graph analysis
framework that processes graphs in a semi-external memory fashion, i.e., keeping
vertex state in memory and edges on SSDs, and scales to graphs with billions of
vertices by utilizing SSDs through SAFS. FlashMatrix is a matrix-oriented programming
framework that supports both sparse matrices and dense matrices for general
data analysis. Similar to FlashGraph, it scales matrix operations beyond memory
capacity by utilizing SSDs. We demonstrate that with the current I/O technologies
FlashGraph and FlashMatrix in the (semi-)external-memory meets or even exceeds
state-of-the-art in-memory data analysis frameworks while scaling to massive datasets
for a large variety of data analysis tasks