1,294 research outputs found
Automatic Design of Application Specific Instruction Set Extensions Through Dataflow Graph Exploration
General-purpose processors are often incapable of achieving the challenging cost, performance, and power demands of high-performance applications. To meet these demands, most systems employ a number of hardware accelerators to off-load the computationally demanding portions of the application. As an alternative to this strategy, we examine customizing the computation capabilities of a processor for a particular application. The processor is extended with hardware in the form of a set of custom function units and instruction set extensions. To effectively identify opportunities for creating custom hardware, a dataflow graph design space exploration engine heuristically identifies candidate computation subgraphs without artificially constraining their size or shape. The engine combines estimates of performance gain, cost, and inherent limitations of the processor to grow candidate graphs in profitable directions while pruning unprofitable paths. This paper describes the dataflow graph exploration engine and evaluates its effectiveness across a set of embedded applications.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/44572/1/10766_2004_Article_476941.pd
Domain-Specific Computing Architectures and Paradigms
We live in an exciting era where artificial intelligence (AI) is fundamentally shifting the dynamics of industries and businesses around the world. AI algorithms such as deep learning (DL) have drastically advanced the state-of-the-art cognition and learning capabilities. However, the power of modern AI algorithms can only be enabled if the underlying domain-specific computing hardware can deliver orders of magnitude more performance and energy efficiency. This work focuses on this goal and explores three parts of the domain-specific computing acceleration problem; encapsulating specialized hardware and software architectures and paradigms that support the ever-growing processing demand of modern AI applications from the edge to the cloud.
This first part of this work investigates the optimizations of a sparse spatio-temporal (ST) cognitive system-on-a-chip (SoC). This design extracts ST features from videos and leverages sparse inference and kernel compression to efficiently perform action classification and motion tracking.
The second part of this work explores the significance of dataflows and reduction mechanisms for sparse deep neural network (DNN) acceleration. This design features a dynamic, look-ahead index matching unit in hardware to efficiently discover fine-grained parallelism, achieving high energy efficiency and low control complexity for a wide variety of DNN layers.
Lastly, this work expands the scope to real-time machine learning (RTML) acceleration. A new high-level architecture modeling framework is proposed. Specifically, this framework consists of a set of high-performance RTML-specific architecture design templates, and a Python-based high-level modeling and compiler tool chain for efficient cross-stack architecture design and exploration.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162870/1/lchingen_1.pd
Methodology for complex dataflow application development
This thesis addresses problems inherent to the development of complex applications for reconfig- urable systems. Many projects fail to complete or take much longer than originally estimated by relying on traditional iterative software development processes typically used with conventional computers. Even though designer productivity can be increased by abstract programming and execution models, e.g., dataflow, development methodologies considering the specific properties of reconfigurable systems do not exist.
The first contribution of this thesis is a design methodology to facilitate systematic develop- ment of complex applications using reconfigurable hardware in the context of High-Performance Computing (HPC). The proposed methodology is built upon a careful analysis of the original application, a software model of the intended hardware system, an analytical prediction of performance and on-chip area usage, and an iterative architectural refinement to resolve identi- fied bottlenecks before writing a single line of code targeting the reconfigurable hardware. It is successfully validated using two real applications and both achieve state-of-the-art performance.
The second contribution extends this methodology to provide portability between devices in two steps. First, additional tool support for contemporary multi-die Field-Programmable Gate Arrays (FPGAs) is developed. An algorithm to automatically map logical memories to hetero- geneous physical memories with special attention to die boundaries is proposed. As a result, only the proposed algorithm managed to successfully place and route all designs used in the evaluation while the second-best algorithm failed on one third of all large applications. Second, best practices for performance portability between different FPGA devices are collected and evaluated on a financial use case, showing efficient resource usage on five different platforms.
The third contribution applies the extended methodology to a real, highly demanding emerging application from the radiotherapy domain. A Monte-Carlo based simulation of dose accumu- lation in human tissue is accelerated using the proposed methodology to meet the real time requirements of adaptive radiotherapy.Open Acces
Type-driven automated program transformations and cost modelling for optimising streaming programs on FPGAs
In this paper we present a novel approach to program optimisation based on compiler-based type-driven program transformations and a fast and accurate cost/performance model for the target architecture. We target streaming programs for the problem domain of scientific computing, such as numerical weather prediction. We present our theoretical framework for type-driven program transformation, our target high-level language and intermediate representation languages and the cost model and demonstrate the effectiveness of our approach by comparison with a commercial toolchain
GPU-based Architecture Modeling and Instruction Set Extension for Signal Processing Applications
The modeling of embedded systems attempts to estimate the performance and costs prior to the implementation. The early stage predictions for performance and power dissipation reduces the more costly late stage design modifications. Workload modeling is an approach where an abstract application is evaluated against an abstract architecture. The challenge in modeling is the balance between fidelity and simplicity, where fidelity refers to the correctness of the predictions and the simplicity relates to the simulation time of the model and its ease of comprehension for the developer. A model named GSLA for performance and power modeling is presented, which extends existing architecture modeling by including GPUs as parallel processing elements. The performance model showed an average fidelity of 93% and the power model demonstrated an average fidelity of 84% between the models and several application measurements. The GSLA model is very simple: only 2 parameters that can be obtained by automated scripts.
Besides the modeling, this thesis addresses lower level signal processing system improvements by proposing Instruction Set Architecture (ISA) extensions for RISC-V processors. A vehicle classifier neural network model was used as a case study, in which the benefit of Bit Manipulation Instructions (BMI) is shown. The result is a new PopCount instruction extension that is verified in ETISS simulator. The PopCount extension of RISC-V ISA showed a performance improvement of more than double for the vehicle classifier application. In addition, the design flow for adding a new instruction extension for a re-configurable platform is presented.
The GPU modeling and the RISC-V ISA extension added new features to the state of the art. They improve the modeling features as well as reduce the execution costs in signal processing platforms
Automated generation of an efficient MPEG-4 Reconfigurable Video Coding decoder implementation
International audienceThis paper proposes an automatic design flow from user-friendly design to efficient implementation of video processing systems. This design flow starts with the use of coarse-grain dataflow representations based on the CAL language, which is a complete language for dataflow programming of embedded systems. Our approach integrates previously developed techniques for detecting synchronous dataflow (SDF) regions within larger CAL networks, and exploiting the static structure of such regions using analysis tools in The Dataflow interchange format Package (TDP). Using a new XML format that we have developed to exchange dataflow information between different dataflow tools, we explore systematic implementation of signal processing systems using CAL, SDF-like region detection, TDP-based static scheduling, and CAL-to-C (CAL2C) translation. Our approach, which is a novel integration of three complementary dataflow tools -- the CAL parser, TDP, and CAL2C -- is demonstrated on an MPEG Reconfigurable Video Coding (RVC) decoder
Customizing the Computation Capabilities of Microprocessors.
Designers of microprocessor-based systems must constantly improve
performance and increase computational efficiency in their designs to
create value. To this end, it is increasingly common to see
computation accelerators in general-purpose processor
designs. Computation accelerators collapse portions of an
application's dataflow graph, reducing the critical path of
computations, easing the burden on processor resources, and reducing
energy consumption in systems. There are many problems associated with
adding accelerators to microprocessors, though. Design of
accelerators, architectural integration, and software support all
present major challenges.
This dissertation tackles these challenges in the context of
accelerators targeting acyclic and cyclic patterns of
computation. First, a technique to identify critical computation
subgraphs within an application set is presented. This technique is
hardware-cognizant and effectively generates a set of instruction set
extensions given a domain of target applications. Next, several
general-purpose accelerator structures are quantitatively designed
using critical subgraph analysis for a broad application set.
The next challenge is architectural integration of
accelerators. Traditionally, software invokes accelerators by
statically encoding new instructions into the application binary. This
is incredibly costly, though, requiring many portions of hardware and
software to be redesigned. This dissertation develops strategies to
utilize accelerators, without changing the instruction set. In the
proposed approach, the microarchitecture translates applications at
run-time, replacing computation subgraphs with microcode to utilize
accelerators. We explore the tradeoffs in performing difficult aspects
of the translation at compile-time, while retaining run-time
replacement. This culminates in a simple microarchitectural interface
that supports a plug-and-play model for integrating accelerators into
a pre-designed microprocessor.
Software support is the last challenge in dealing with computation
accelerators. The primary issue is difficulty in generating
high-quality code utilizing accelerators. Hand-written assembly code
is standard in industry, and if compiler support does exist, simple
greedy algorithms are common. In this work, we investigate more
thorough techniques for compiling for computation accelerators. Where
greedy heuristics only explore one possible solution, the techniques
in this dissertation explore the entire design space, when
possible. Intelligent pruning methods ensure that compilation is both
tractable and scalable.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57633/2/ntclark_1.pd
- …