3 research outputs found
Efficient and portable multi-tasking for heterogeneous systems
Modern computing systems comprise heterogeneous designs which combine multiple
and diverse architectures on a single system. These designs provide potentials for
high performance under reduced power requirements but require advanced resource
management and workload scheduling across the available processors.
Programmability frameworks, such as OpenCL and CUDA, enable resource management
and workload scheduling on heterogeneous systems. These frameworks fully
assign the control of resource allocation and scheduling to the application. This design
sufficiently serves the needs of dedicated application systems but introduces significant
challenges for multi-tasking environments where multiple users and applications
compete for access to system resources.
This thesis considers these challenges and presents three major contributions that
enable efficient multi-tasking on heterogeneous systems. The presented contributions
are compatible with existing systems, remain portable across vendors and do not require
application changes or recompilation.
The first contribution of this thesis is an optimization technique that reduces host-device
communication overhead for OpenCL applications. It does this without modification
or recompilation of the application source code and is portable across platforms.
This work enables efficiency and performance improvements for diverse application
workloads found on multi-tasking systems.
The second contribution is the design and implementation of a secure, user-space
virtualization layer that integrates the accelerator resources of a system with the standard
multi-tasking and user-space virtualization facilities of the commodity Linux OS.
It enables fine-grained sharing of mixed-vendor accelerator resources and targets heterogeneous
systems found in data center nodes and requires no modification to the OS,
OpenCL or application.
Lastly, the third contribution is a technique and software infrastructure that enable
resource sharing control on accelerators, while supporting software managed scheduling
on accelerators. The infrastructure remains transparent to existing systems and
applications and requires no modifications or recompilation. In enforces fair accelerator
sharing which is required for multi-tasking purposes
Automatically Harnessing Sparse Acceleration
Sparse linear algebra is central to many scientific programs, yet compilers
fail to optimize it well. High-performance libraries are available, but
adoption costs are significant. Moreover, libraries tie programs into
vendor-specific software and hardware ecosystems, creating non-portable code.
In this paper, we develop a new approach based on our specification Language
for implementers of Linear Algebra Computations (LiLAC). Rather than requiring
the application developer to (re)write every program for a given library, the
burden is shifted to a one-off description by the library implementer. The
LiLAC-enabled compiler uses this to insert appropriate library routines without
source code changes.
LiLAC provides automatic data marshaling, maintaining state between calls and
minimizing data transfers. Appropriate places for library insertion are
detected in compiler intermediate representation, independent of source
languages.
We evaluated on large-scale scientific applications written in FORTRAN;
standard C/C++ and FORTRAN benchmarks; and C++ graph analytics kernels. Across
heterogeneous platforms, applications and data sets we show speedups of
1.1 to over 10 without user intervention.Comment: Accepted to CC 202