1,960 research outputs found
A Multilevel Introspective Dynamic Optimization System For Holistic Power-Aware Computing
Power consumption is rapidly becoming the dominant limiting factor for
further improvements in computer design. Curiously, this applies both
at the "high end" of workstations and servers and the "low end" of
handheld devices and embedded computers. At the high-end, the
challenge lies in dealing with exponentially growing power
densities. At the low-end, there is a demand to make mobile devices
more powerful and longer lasting, but battery technology is not
improving at the same
rate that power consumption is rising. Traditional power-management
research is fragmented; techniques are being developed at specific
levels, without fully exploring their synergy with other levels.
Most software techniques target either operating systems or
compilers but do not explore the interaction between the two
layers. These techniques also have not fully explored the potential
of virtual machines for power management.
In contrast, we are developing
a system that integrates information from multiple levels of software
and hardware, connecting these levels through a communication
channel. At the heart of this
system are a virtual machine that compiles and dynamically profiles
code, and an optimizer that reoptimizes
all code, including that of applications and the virtual machine itself.
We believe this introspective, holistic approach
enables more informed power-management decisions
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning
With the emergence of a spectrum of high-end mobile devices, many
applications that formerly required desktop-level computation capability are
being transferred to these devices. However, executing the inference of Deep
Neural Networks (DNNs) is still challenging considering high computation and
storage demands, specifically, if real-time performance with high accuracy is
needed. Weight pruning of DNNs is proposed, but existing schemes represent two
extremes in the design space: non-structured pruning is fine-grained, accurate,
but not hardware friendly; structured pruning is coarse-grained,
hardware-efficient, but with higher accuracy loss. In this paper, we introduce
a new dimension, fine-grained pruning patterns inside the coarse-grained
structures, revealing a previously unknown point in design space. With the
higher accuracy enabled by fine-grained pruning patterns, the unique insight is
to use the compiler to re-gain and guarantee high hardware efficiency. In other
words, our method achieves the best of both worlds, and is desirable across
theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an
end-to-end framework to efficiently execute DNN on mobile devices with the help
of a novel model compression technique (pattern-based pruning based on extended
ADMM solution framework) and a set of thorough architecture-aware compiler- and
code generation-based optimizations (filter kernel reordering, compressed
weight storage, register load redundancy elimination, and parameter
auto-tuning). Evaluation results demonstrate that PatDNN outperforms three
state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba
Mobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively,
with no accuracy compromise. Real-time inference of representative large-scale
DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.Comment: To be published in the Proceedings of Twenty-Fifth International
Conference on Architectural Support for Programming Languages and Operating
Systems (ASPLOS 20
Energy Transparency for Deeply Embedded Programs
Energy transparency is a concept that makes a program's energy consumption
visible, from hardware up to software, through the different system layers.
Such transparency can enable energy optimizations at each layer and between
layers, and help both programmers and operating systems make energy-aware
decisions. In this paper, we focus on deeply embedded devices, typically used
for Internet of Things (IoT) applications, and demonstrate how to enable energy
transparency through existing Static Resource Analysis (SRA) techniques and a
new target-agnostic profiling technique, without hardware energy measurements.
Our novel mapping technique enables software energy consumption estimations at
a higher level than the Instruction Set Architecture (ISA), namely the LLVM
Intermediate Representation (IR) level, and therefore introduces energy
transparency directly to the LLVM optimizer. We apply our energy estimation
techniques to a comprehensive set of benchmarks, including single- and also
multi-threaded embedded programs from two commonly used concurrency patterns,
task farms and pipelines. Using SRA, our LLVM IR results demonstrate a high
accuracy with a deviation in the range of 1% from the ISA SRA. Our profiling
technique captures the actual energy consumption at the LLVM IR level with an
average error of 3%.Comment: 33 pages, 7 figures. arXiv admin note: substantial text overlap with
arXiv:1510.0709
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
pocl: A Performance-Portable OpenCL Implementation
OpenCL is a standard for parallel programming of heterogeneous systems. The
benefits of a common programming standard are clear; multiple vendors can
provide support for application descriptions written according to the standard,
thus reducing the program porting effort. While the standard brings the obvious
benefits of platform portability, the performance portability aspects are
largely left to the programmer. The situation is made worse due to multiple
proprietary vendor implementations with different characteristics, and, thus,
required optimization strategies.
In this paper, we propose an OpenCL implementation that is both portable and
performance portable. At its core is a kernel compiler that can be used to
exploit the data parallelism of OpenCL programs on multiple platforms with
different parallel hardware styles. The kernel compiler is modularized to
perform target-independent parallel region formation separately from the
target-specific parallel mapping of the regions to enable support for various
styles of fine-grained parallel resources such as subword SIMD extensions, SIMD
datapaths and static multi-issue. Unlike previous similar techniques that work
on the source level, the parallel region formation retains the information of
the data parallelism using the LLVM IR and its metadata infrastructure. This
data can be exploited by the later generic compiler passes for efficient
parallelization.
The proposed open source implementation of OpenCL is also platform portable,
enabling OpenCL on a wide range of architectures, both already commercialized
and on those that are still under research. The paper describes how the
portability of the implementation is achieved. Our results show that most of
the benchmarked applications when compiled using pocl were faster or close to
as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via
arxi
Secure Compilation (Dagstuhl Seminar 18201)
Secure compilation is an emerging field that puts together advances in
security, programming languages, verification, systems, and hardware
architectures in order to devise secure compilation chains that
eliminate many of today\u27s vulnerabilities.
Secure compilation aims to protect a source language\u27s abstractions in
compiled code, even against low-level attacks.
For a concrete example, all modern languages provide a notion of
structured control flow and an invoked procedure is expected to return
to the right place.
However, today\u27s compilation chains (compilers, linkers, loaders,
runtime systems, hardware) cannot efficiently enforce this
abstraction: linked low-level code can call and return to arbitrary
instructions or smash the stack, blatantly violating the high-level
abstraction.
The emerging secure compilation community aims to address such
problems by devising formal security criteria, efficient enforcement
mechanisms, and effective proof techniques.
This seminar strived to take a broad and inclusive view of secure
compilation and to provide a forum for discussion on the topic. The
goal was to identify interesting research directions and open
challenges by bringing together people working on building secure
compilation chains, on developing proof techniques and verification
tools, and on designing security mechanisms
- …