2,261 research outputs found
Automatically Harnessing Sparse Acceleration
Sparse linear algebra is central to many scientific programs, yet compilers
fail to optimize it well. High-performance libraries are available, but
adoption costs are significant. Moreover, libraries tie programs into
vendor-specific software and hardware ecosystems, creating non-portable code.
In this paper, we develop a new approach based on our specification Language
for implementers of Linear Algebra Computations (LiLAC). Rather than requiring
the application developer to (re)write every program for a given library, the
burden is shifted to a one-off description by the library implementer. The
LiLAC-enabled compiler uses this to insert appropriate library routines without
source code changes.
LiLAC provides automatic data marshaling, maintaining state between calls and
minimizing data transfers. Appropriate places for library insertion are
detected in compiler intermediate representation, independent of source
languages.
We evaluated on large-scale scientific applications written in FORTRAN;
standard C/C++ and FORTRAN benchmarks; and C++ graph analytics kernels. Across
heterogeneous platforms, applications and data sets we show speedups of
1.1 to over 10 without user intervention.Comment: Accepted to CC 202
AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
CPU-FPGA heterogeneous architectures are attracting ever-increasing attention
in an attempt to advance computational capabilities and energy efficiency in
today's datacenters. These architectures provide programmers with the ability
to reprogram the FPGAs for flexible acceleration of many workloads.
Nonetheless, this advantage is often overshadowed by the poor programmability
of FPGAs whose programming is conventionally a RTL design practice. Although
recent advances in high-level synthesis (HLS) significantly improve the FPGA
programmability, it still leaves programmers facing the challenge of
identifying the optimal design configuration in a tremendous design space.
This paper aims to address this challenge and pave the path from software
programs towards high-quality FPGA accelerators. Specifically, we first propose
the composable, parallel and pipeline (CPP) microarchitecture as a template of
accelerator designs. Such a well-defined template is able to support efficient
accelerator designs for a broad class of computation kernels, and more
importantly, drastically reduce the design space. Also, we introduce an
analytical model to capture the performance and resource trade-offs among
different design configurations of the CPP microarchitecture, which lays the
foundation for fast design space exploration. On top of the CPP
microarchitecture and its analytical model, we develop the AutoAccel framework
to make the entire accelerator generation automated. AutoAccel accepts a
software program as an input and performs a series of code transformations
based on the result of the analytical-model-based design space exploration to
construct the desired CPP microarchitecture. Our experiments show that the
AutoAccel-generated accelerators outperform their corresponding software
implementations by an average of 72x for a broad class of computation kernels
SOLUTIONS FOR OPTIMIZING THE DATA PARALLEL PREFIX SUM ALGORITHM USING THE COMPUTE UNIFIED DEVICE ARCHITECTURE
In this paper, we analyze solutions for optimizing the data parallel prefix sum function using the Compute Unified Device Architecture (CUDA) that provides a viable solution for accelerating a broad class of applications. The parallel prefix sum function is an essential building block for many data mining algorithms, and therefore its optimization facilitates the whole data mining process. Finally, we benchmark and evaluate the performance of the optimized parallel prefix sum building block in CUDA.CUDA, threads, GPGPU, parallel prefix sum, parallel processing, task synchronization, warp
An Optimized and Scalable Eigensolver for Sequences of Eigenvalue Problems
In many scientific applications the solution of non-linear differential
equations are obtained through the set-up and solution of a number of
successive eigenproblems. These eigenproblems can be regarded as a sequence
whenever the solution of one problem fosters the initialization of the next. In
addition, in some eigenproblem sequences there is a connection between the
solutions of adjacent eigenproblems. Whenever it is possible to unravel the
existence of such a connection, the eigenproblem sequence is said to be
correlated. When facing with a sequence of correlated eigenproblems the current
strategy amounts to solving each eigenproblem in isolation. We propose a
alternative approach which exploits such correlation through the use of an
eigensolver based on subspace iteration and accelerated with Chebyshev
polynomials (ChFSI). The resulting eigensolver is optimized by minimizing the
number of matrix-vector multiplications and parallelized using the Elemental
library framework. Numerical results show that ChFSI achieves excellent
scalability and is competitive with current dense linear algebra parallel
eigensolvers.Comment: 23 Pages, 6 figures. First revision of an invited submission to
special issue of Concurrency and Computation: Practice and Experienc
The Caltech CSN project collects sensor data from thousands of personal devices for realtime response to dangerous earthquakes
The proliferation of smartphones and other powerful sensor-equipped consumer devices enables a new class of Web application: community sense and response (CSR) systems, distinguished from standard Web applications by their use of community-owned commercial sensor hardware. Just as social networks connect and share human-generated content, CSR systems gather, share, and act on sensory data from users' Internet-enabled devices. Here, we discuss the Caltech Community Seismic Network (CSN) as a prototypical CSR system harnessing accelerometers in smartphones and consumer electronics, including the systems and algorithmic challenges of designing, building, and evaluating a scalable network for real-time awareness of dangerous earthquakes
- …