168,387 research outputs found
Compiling Deep Learning Models for Custom Hardware Accelerators
Convolutional neural networks (CNNs) are the core of most state-of-the-art
deep learning algorithms specialized for object detection and classification.
CNNs are both computationally complex and embarrassingly parallel. Two
properties that leave room for potential software and hardware optimizations
for embedded systems. Given a programmable hardware accelerator with a CNN
oriented custom instructions set, the compiler's task is to exploit the
hardware's full potential, while abiding with the hardware constraints and
maintaining generality to run different CNN models with varying workload
properties. Snowflake is an efficient and scalable hardware accelerator
implemented on programmable logic devices. It implements a control pipeline for
a custom instruction set. The goal of this paper is to present Snowflake's
compiler that generates machine level instructions from Torch7 model
description files. The main software design points explored in this work are:
model structure parsing, CNN workload breakdown, loop rearrangement for memory
bandwidth optimizations and memory access balancing. The performance achieved
by compiler generated instructions matches against hand optimized code for
convolution layers. Generated instructions also efficiently execute AlexNet and
ResNet18 inference on Snowflake. Snowflake with processing units was
synthesized on Xilinx's Zynq XC7Z045 FPGA. At MHz, AlexNet achieved in
frames/s and GB/s of off-chip memory bandwidth, and
frames/s and GB/s for ResNet18. Total on-chip power is W.Comment: 8 page
Design space exploration tools for the ByoRISC configurable processor family
In this paper, the ByoRISC (Build your own RISC) configurable
application-specific instruction-set processor (ASIP) family is presented.
ByoRISCs, as vendor-independent cores, provide extensive architectural
parameters over a baseline processor, which can be customized by
application-specific hardware extensions (ASHEs). Such extensions realize
multi-input multi-output (MIMO) custom instructions with local state and
load/store accesses to the data memory. ByoRISCs incorporate a true multi-port
register file, zero-overhead custom instruction decoding, and scalable data
forwarding mechanisms. Given these design decisions, ByoRISCs provide a unique
combination of features that allow their use as architectural testbeds and the
seamless and rapid development of new high-performance ASIPs.
The performance characteristics of ByoRISCs, implemented as
vendor-independent cores, have been evaluated for both ASIC and FPGA
implementations, and it is proved that they provide a viable solution in
FPGA-based system-on-a-chip design. A case study of an image processing
pipeline is also presented to highlight the process of utilizing a ByoRISC
custom processor. A peak performance speedup of up to 8.5 can be
observed, whereas an average performance speedup of 4.4 on Xilinx
Virtex-4 targets is achieved. In addition, ByoRISC outperforms an experimental
VLIW architecture named VEX even in its 16-wide configuration for a number of
data-intensive application kernels.Comment: 12 pages, 14 figures, 7 tables. Unpublished paper on ByoRISC, an
extensible RISC with MIMO CIs that can outperform most mid-range VLIWs.
Unfortunately Prof. Jorg Henkel destroyed the potential of this submission by
using immoral tactics (neglecting his conflict of interest, changing
reviewers accepting the paper, and requesting impossible additions for the
average lifetime of an Earthlin
Accelerating the Development of Software-Defined Network Optimization Applications Using SOL
Software-defined networking (SDN) can enable diverse network management
applications such as traffic engineering, service chaining, network function
outsourcing, and topology reconfiguration. Realizing the benefits of SDN for
these applications, however, entails addressing complex network optimizations
that are central to these problems. Unfortunately, such optimization problems
require significant manual effort and expertise to express and non-trivial
computation and/or carefully crafted heuristics to solve. Our vision is to
simplify the deployment of SDN applications using general high-level
abstractions for capturing optimization requirements from which we can
efficiently generate optimal solutions. To this end, we present SOL, a
framework that demonstrates that it is indeed possible to simultaneously
achieve generality and efficiency. The insight underlying SOL is that SDN
applications can be recast within a unifying path-based optimization
abstraction, from which it efficiently generates near-optimal solutions, and
device configurations to implement those solutions. We illustrate the
generality of SOL by prototyping diverse and new applications. We show that SOL
simplifies the development of SDN-based network optimization applications and
provides comparable or better scalability than custom optimization solutions
Generating and evaluating application-specific hardware extensions
Modern platform-based design involves the application-specific extension of
embedded processors to fit customer requirements. To accomplish this task, the
possibilities offered by recent custom/extensible processors for tuning their
instruction set and microarchitecture to the applications of interest have to
be exploited. A significant factor often determining the success of this
process is the utomation available in application analysis and custom
instruction generation.
In this paper we present YARDstick, a design automation tool for custom
processor development flows that focuses on generating and evaluating
application-specific hardware extensions. YARDstick is a building block for
ASIP development, integrating application analysis, custom instruction
generation and selection with user-defined compiler intermediate
representations. In a YARDstick-enabled environment, practical issues in
traditional ASIP design are confronted efficiently; the exploration
infrastructure is liberated from compiler and simulator idiosyncrasies, since
the ASIP designer is empowered with the freedom of specifying the target
architectures of choice and adding new implementations of analyses and custom
instruction generation/selection methods. To illustrate the capabilities of the
YARDstick approach, we present interesting exploration scenarios: quantifying
the effect of machine-dependent compiler optimizations and the selection of the
target architecture in terms of operation set and memory model on custom
instruction generation/selection under different input/output constraints.Comment: 11 pages, 15 figures, 5 tables. An unpublished journal paper
presenting the YARDstick custom instruction generation environmen
Mask Editor : an Image Annotation Tool for Image Segmentation Tasks
Deep convolutional neural network (DCNN) is the state-of-the-art method for
image segmentation, which is one of key challenging computer vision tasks.
However, DCNN requires a lot of training images with corresponding image masks
to get a good segmentation result. Image annotation software which is easy to
use and allows fast image mask generation is in great demand. To the best of
our knowledge, all existing image annotation software support only drawing
bounding polygons, bounding boxes, or bounding ellipses to mark target objects.
These existing software are inefficient when targeting objects that have
irregular shapes (e.g., defects in fabric images or tire images). In this paper
we design an easy-to-use image annotation software called Mask Editor for image
mask generation. Mask Editor allows drawing any bounding curve to mark objects
and improves efficiency to mark objects with irregular shapes. Mask Editor also
supports drawing bounding polygons, drawing bounding boxes, drawing bounding
ellipses, painting, erasing, super-pixel-marking, image cropping, multi-class
masks, mask loading, and mask modifying
Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays
FPGA vendors have recently started focusing on OpenCL for FPGAs because of
its ability to leverage the parallelism inherent to heterogeneous computing
platforms. OpenCL allows programs running on a host computer to launch
accelerator kernels which can be compiled at run-time for a specific
architecture, thus enabling portability. However, the prohibitive compilation
times (specifically the FPGA place and route times) are a major stumbling block
when using OpenCL tools from FPGA vendors. The long compilation times mean that
the tools cannot effectively use just-in-time (JIT) compilation or runtime
performance scaling. Coarse-grained overlays represent a possible solution by
virtue of their coarse granularity and fast compilation. In this paper, we
present a methodology for run-time compilation of OpenCL kernels to a DSP block
based coarse-grained overlay, rather than directly to the fine-grained FPGA
fabric. The proposed methodology allows JIT compilation and on-demand
resource-aware kernel replication to better utilize available overlay
resources, raising the abstraction level while reducing compile times
significantly. We further demonstrate that this approach can even be used for
run-time compilation of OpenCL kernels on the ARM processor of the embedded
heterogeneous Zynq device.Comment: Presented at 3rd International Workshop on Overlay Architectures for
FPGAs (OLAF 2017) arXiv:1704.0880
Aashiyana: Design and Evaluation of a Smart Demand-Response System for Highly-stressed Grids
This paper targets the unexplored problem of demand response within the
context of power-grids that are allowed to regularly enforce blackouts as a
mean to balance supply with demand:highly-stressed grids. Currently these
utilities use as a cyclic and binary (power/no-power) schedule over consumer
groups leading to significant wastage of capacity and long hours of no-power.
We present here a novel building DLC system, Aashiyana, that can enforce
several user-defined low-power states. We evaluate distributed and centralized
load-shedding schemes using Aashiyana that can, compared to current
load-shedding strategy, reduce the number of homes with no power by 80% for
minor change in the fraction of homes with full-power
DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators
The convolutional neural network (CNN) has become a state-of-the-art method
for several artificial intelligence domains in recent years. The increasingly
complex CNN models are both computation-bound and I/O-bound. FPGA-based
accelerators driven by custom instruction set architecture (ISA) achieve a
balance between generality and efficiency, but there is much on them left to be
optimized. We propose the full-stack compiler DNNVM, which is an integration of
optimizers for graphs, loops and data layouts, and an assembler, a runtime
supporter and a validation environment. The DNNVM works in the context of deep
learning frameworks and transforms CNN models into the directed acyclic graph:
XGraph. Based on XGraph, we transform the optimization challenges for both the
data layout and pipeline into graph-level problems. DNNVM enumerates all
potentially profitable fusion opportunities by a heuristic subgraph isomorphism
algorithm to leverage pipeline and data layout optimizations, and searches for
the best choice of execution strategies of the whole computing graph. On the
Xilinx ZU2 @330 MHz and ZU9 @330 MHz, we achieve equivalently state-of-the-art
performance on our benchmarks by na\"ive implementations without optimizations,
and the throughput is further improved up to 1.26x by leveraging heterogeneous
optimizations in DNNVM. Finally, with ZU9 @330 MHz, we achieve state-of-the-art
performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an
energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38
TOPs/s for ResNet50 and 1.41 TOPs/s for GoogleNet.Comment: 18 pages, 9 figures, 5 table
Simulink model code generation for motor control applications
This article is focused on the embedded software development using code generation. C-code is generated from Simulink model. Article describes hardware interface of Simulink model for AC motor control applications code generation. Results are presented on platform with ARM Cortex-M4 micro-controller, inverter and permanent magnet synchronous machine. Measurements of speed control loop on a real machine are presented and utilization of used micro-controller are discussed in conclusion
Effective Extensible Programming: Unleashing Julia on GPUs
GPUs and other accelerators are popular devices for accelerating
compute-intensive, parallelizable applications. However, programming these
devices is a difficult task. Writing efficient device code is challenging, and
is typically done in a low-level programming language. High-level languages are
rarely supported, or do not integrate with the rest of the high-level language
ecosystem. To overcome this, we propose compiler infrastructure to efficiently
add support for new hardware or environments to an existing programming
language.
We evaluate our approach by adding support for NVIDIA GPUs to the Julia
programming language. By integrating with the existing compiler, we
significantly lower the cost to implement and maintain the new compiler, and
facilitate reuse of existing application code. Moreover, use of the high-level
Julia programming language enables new and dynamic approaches for GPU
programming. This greatly improves programmer productivity, while maintaining
application performance similar to that of the official NVIDIA CUDA toolkit
- …