1,174 research outputs found
Stripe: Tensor Compilation via the Nested Polyhedral Model
Hardware architectures and machine learning (ML) libraries evolve rapidly.
Traditional compilers often fail to generate high-performance code across the
spectrum of new hardware offerings. To mitigate, engineers develop hand-tuned
kernels for each ML library update and hardware upgrade. Unfortunately, this
approach requires excessive engineering effort to scale or maintain with any
degree of state-of-the-art performance. Here we present a Nested Polyhedral
Model for representing highly parallelizable computations with limited
dependencies between iterations. This model provides an underlying framework
for an intermediate representation (IR) called Stripe, amenable to standard
compiler techniques while naturally modeling key aspects of modern ML
computing. Stripe represents parallelism, efficient memory layout, and multiple
compute units at a level of abstraction amenable to automatic optimization. We
describe how Stripe enables a compiler for ML in the style of LLVM that allows
independent development of algorithms, optimizations, and hardware
accelerators. We also discuss the design exploration advantages of Stripe over
kernel libraries and schedule-based or schedule-space-based code generation
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
In recent years, there is a surge on machine learning applications in
industry. Many of them are based on popular AI frameworks like Tensorflow,
Torch, Caffe, or MxNet, etc, and are enpowered by accelerator platforms such as
GPUs. One important challenge of running Tensorflow computations on GPUs is the
fine granularity problem, namely, FLOPS of individual ops are far from enough
to fully exploit the computing power of underlying accelerators. The XLA
framework provides a solid foundation to explore this problem further. In this
paper, we propose FusionStitching, a novel, comprehensive Op fusion and code
generation system to stitch computations into large GPU kernels. Experimental
results on four public models and two of our large inhouse applications show
another 55% (geometric mean) reduction of GPU kernel launches, compared to the
XLA fusion baseline. This increases the E2E performance of both of our latency
critical inhouse applications up to 20%.Comment: 11 pages, 8 figure
Relay: A High-Level Compiler for Deep Learning
Frameworks for writing, compiling, and optimizing deep learning (DL) models
have recently enabled progress in areas like computer vision and natural
language processing. Extending these frameworks to accommodate the rapidly
diversifying landscape of DL models and hardware platforms presents challenging
tradeoffs between expressivity, composability, and portability. We present
Relay, a new compiler framework for DL. Relay's functional, statically typed
intermediate representation (IR) unifies and generalizes existing DL IRs to
express state-of-the-art models. The introduction of Relay's expressive IR
requires careful design of domain-specific optimizations, addressed via Relay's
extension mechanisms. Using these extension mechanisms, Relay supports a
unified compiler that can target a variety of hardware platforms. Our
evaluation demonstrates Relay's competitive performance for a broad class of
models and devices (CPUs, GPUs, and emerging accelerators). Relay's design
demonstrates how a unified IR can provide expressivity, composability, and
portability without compromising performance
A Hardware-Software Blueprint for Flexible Deep Learning Specialization
Specialized Deep Learning (DL) acceleration stacks, designed for a specific
set of frameworks, model architectures, operators, and data types, offer the
allure of high performance while sacrificing flexibility. Changes in
algorithms, models, operators, or numerical systems threaten the viability of
specialized hardware accelerators. We propose VTA, a programmable deep learning
architecture template designed to be extensible in the face of evolving
workloads. VTA achieves this flexibility via a parametrizable architecture,
two-level ISA, and a JIT compiler. The two-level ISA is based on (1) a task-ISA
that explicitly orchestrates concurrent compute and memory tasks and (2) a
microcode-ISA which implements a wide variety of operators with single-cycle
tensor-tensor operations. Next, we propose a runtime system equipped with a JIT
compiler for flexible code-generation and heterogeneous execution that enables
effective use of the VTA architecture. VTA is integrated and open-sourced into
Apache TVM, a state-of-the-art deep learning compilation stack that provides
flexibility for diverse models and divergent hardware backends. We propose a
flow that performs design space exploration to generate a customized hardware
architecture and software operator library that can be leveraged by mainstream
learning frameworks. We demonstrate our approach by deploying optimized deep
learning models used for object classification and style transfer on edge-class
FPGAs.Comment: 6 pages plus references, 8 figure
FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads
Performance optimization is the art of continuous seeking a harmonious
mapping between the application domain and hardware. Recent years have
witnessed a surge of deep learning (DL) applications in industry. Conventional
wisdom for optimizing such workloads mainly focus on compute intensive ops
(GEMM, Convolution, etc). Yet we show in this work, that the performance of
memory intensive computations is vital to E2E performance in practical DL
models. We propose \emph{FusionStitching}, a optimization framework capable of
fusing memory intensive \emph{elementwise}, \emph{reduction} and fine grained
\emph{GEMM/Batched-GEMM} ops, with or without data dependences, into large
computation units, then mapping and transforming them into efficient GPU
kernels. We formulate the fusion plan optimization as an integer linear
programming (ILP) problem, and propose a set of empirical heuristics to reduce
the combinatorial search space. In order to map optimized fusion plans to
hardware, we propose a technique to effectively compose various groups of
computations into a single GPU kernel, by fully leveraging on chip resources
like scratchpads or registers. Experimental results on six benchmarks and four
industry scale practical models are encouraging. Overall,
\emph{FusionStitching} can reach up to 5.7x speedup compared to Tensorflow
baseline, and achieves 1.25x to 1.85x performance speedups compared to current
state of the art, with 1.4x on average (geometric mean).Comment: 11+ page
DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators
The convolutional neural network (CNN) has become a state-of-the-art method
for several artificial intelligence domains in recent years. The increasingly
complex CNN models are both computation-bound and I/O-bound. FPGA-based
accelerators driven by custom instruction set architecture (ISA) achieve a
balance between generality and efficiency, but there is much on them left to be
optimized. We propose the full-stack compiler DNNVM, which is an integration of
optimizers for graphs, loops and data layouts, and an assembler, a runtime
supporter and a validation environment. The DNNVM works in the context of deep
learning frameworks and transforms CNN models into the directed acyclic graph:
XGraph. Based on XGraph, we transform the optimization challenges for both the
data layout and pipeline into graph-level problems. DNNVM enumerates all
potentially profitable fusion opportunities by a heuristic subgraph isomorphism
algorithm to leverage pipeline and data layout optimizations, and searches for
the best choice of execution strategies of the whole computing graph. On the
Xilinx ZU2 @330 MHz and ZU9 @330 MHz, we achieve equivalently state-of-the-art
performance on our benchmarks by na\"ive implementations without optimizations,
and the throughput is further improved up to 1.26x by leveraging heterogeneous
optimizations in DNNVM. Finally, with ZU9 @330 MHz, we achieve state-of-the-art
performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an
energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38
TOPs/s for ResNet50 and 1.41 TOPs/s for GoogleNet.Comment: 18 pages, 9 figures, 5 table
Optimizing Data-Intensive Computations in Existing Libraries with Split Annotations
Data movement between main memory and the CPU is a major bottleneck in
parallel data-intensive applications. In response, researchers have proposed
using compilers and intermediate representations (IRs) that apply optimizations
such as loop fusion under existing high-level APIs such as NumPy and
TensorFlow. Even though these techniques generally do not require changes to
user applications, they require intrusive changes to the library itself: often,
library developers must rewrite each function using a new IR. In this paper, we
propose a new technique called split annotations (SAs) that enables key data
movement optimizations over unmodified library functions. SAs only require
developers to annotate functions and implement an API that specifies how to
partition data in the library. The annotation and API describe how to enable
cross-function data pipelining and parallelization, while respecting each
function's correctness constraints. We implement a parallel runtime for SAs in
a system called Mozart. We show that Mozart can accelerate workloads in
libraries such as Intel MKL and Pandas by up to 15x, with no library
modifications. Mozart also provides performance gains competitive with
solutions that require rewriting libraries, and can sometimes outperform these
systems by up to 2x by leveraging existing hand-optimized code.Comment: Appearing in SOSP 2019, Huntsville, ON, C
Automatic Full Compilation of Julia Programs and ML Models to Cloud TPUs
Google's Cloud TPUs are a promising new hardware architecture for machine
learning workloads. They have powered many of Google's milestone machine
learning achievements in recent years. Google has now made TPUs available for
general use on their cloud platform and as of very recently has opened them up
further to allow use by non-TensorFlow frontends. We describe a method and
implementation for offloading suitable sections of Julia programs to TPUs via
this new API and the Google XLA compiler. Our method is able to completely fuse
the forward pass of a VGG19 model expressed as a Julia program into a single
TPU executable to be offloaded to the device. Our method composes well with
existing compiler-based automatic differentiation techniques on Julia code, and
we are thus able to also automatically obtain the VGG19 backwards pass and
similarly offload it to the TPU. Targeting TPUs using our compiler, we are able
to evaluate the VGG19 forward pass on a batch of 100 images in 0.23s which
compares favorably to the 52.4s required for the original model on the CPU. Our
implementation is less than 1000 lines of Julia, with no TPU specific changes
made to the core Julia compiler or any other Julia packages.Comment: Submitted to SysML 201
Project Beehive: A Hardware/Software Co-designed Stack for Runtime and Architectural Research
The end of Dennard scaling combined with stagnation in architectural and
compiler optimizations makes it challenging to achieve significant performance
deltas. Solutions based solely in hardware or software are no longer sufficient
to maintain the pace of improvements seen during the past few decades. In
hardware, the end of single-core scaling resulted in the proliferation of
multi-core system architectures, however this has forced complex parallel
programming techniques into the mainstream. To further exploit physical
resources, systems are becoming increasingly heterogeneous with specialized
computing elements and accelerators. Programming across a range of disparate
architectures requires a new level of abstraction that programming languages
will have to adapt to. In software, emerging complex applications, from domains
such as Big Data and computer vision, run on multi-layered software stacks
targeting hardware with a variety of constraints and resources. Hence,
optimizing for the power-performance (and resiliency) space requires
experimentation platforms that offer quick and easy prototyping of
hardware/software co-designed techniques. To that end, we present Project
Beehive: A Hardware/Software co-designed stack for runtime and architectural
research. Project Beehive utilizes various state-of-the-art software and
hardware components along with novel and extensible co-design techniques. The
objective of Project Beehive is to provide a modern platform for
experimentation on emerging applications, programming languages, compilers,
runtimes, and low-power heterogeneous many-core architectures in a full-system
co-designed manner.Comment: New version of this pape
DLVM: A modern compiler infrastructure for deep learning systems
Deep learning software demands reliability and performance. However, many of
the existing deep learning frameworks are software libraries that act as an
unsafe DSL in Python and a computation graph interpreter. We present DLVM, a
design and implementation of a compiler infrastructure with a linear algebra
intermediate representation, algorithmic differentiation by adjoint code
generation, domain-specific optimizations and a code generator targeting GPU
via LLVM. Designed as a modern compiler infrastructure inspired by LLVM, DLVM
is more modular and more generic than existing deep learning compiler
frameworks, and supports tensor DSLs with high expressivity. With our
prototypical staged DSL embedded in Swift, we argue that the DLVM system
enables a form of modular, safe and performant frameworks for deep learning
- …