Search CORE

249 research outputs found

Automatic Design of Efficient Application-centric Architectures.

Author: Fan Kevin C.
Publication venue
Publication date
Field of study

As the market for embedded devices continues to grow, the demand for high performance, low cost, and low power computation grows as well. Many embedded applications perform computationally intensive tasks such as processing streaming video or audio, wireless communication, or speech recognition and must be implemented within tight power budgets. Typically, general purpose processors are not able to meet these performance and power requirements. Custom hardware in the form of loop accelerators are often used to execute the compute-intensive portions of these applications because they can achieve significantly higher levels of performance and power efficiency. Automated hardware synthesis from high level specifications is a key technology used in designing these accelerators, because the resulting hardware is correct by construction, easing verification and greatly decreasing time-to-market in the quickly evolving embedded domain. In this dissertation, a compiler-directed approach is used to design a loop accelerator from a C specification and a throughput requirement. The compiler analyzes the loop and generates a virtual architecture containing sufficient resources to sustain the required throughput. Next, a software pipelining scheduler maps the operations in the loop to the virtual architecture. Finally, the accelerator datapath is derived from the resulting schedule. In this dissertation, synthesis of different types of loop accelerators is investigated. First, the system for synthesizing single loop accelerators is detailed. In particular, a scheduler is presented that is aware of the effects of its decisions on the resulting hardware, and attempts to minimize hardware cost. Second, synthesis of multifunction loop accelerators, or accelerators capable of executing multiple loops, is presented. Such accelerators exploit coarse-grained hardware sharing across loops in order to reduce overall cost. Finally, synthesis of post-programmable accelerators is presented, allowing changes to be made to the software after an accelerator has been created. The tradeoffs between the flexibility, cost, and energy efficiency of these different types of accelerators are investigated. Automatically synthesized loop accelerators are capable of achieving order-of-magnitude gains in performance, area efficiency, and power efficiency over processors, and programmable accelerators allow software changes while maintaining highly efficient levels of computation.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61644/1/fank_1.pd

Deep Blue Documents at the University of Michigan

Streamroller : A Unified Compilation and Synthesis System for Streaming Applications.

Author: Kudlur Manjunath V.
Publication venue
Publication date
Field of study

The growing complexity of applications has increased the need for higher processing power. In the embedded domain, the convergence of audio, video, and networking on a handheld device has prompted the need for low cost, low power,and high performance implementations of these applications in the form of custom hardware. In a more mainstream domain like gaming consoles, the move towards more realism in physics simulations and graphics has forced the industry towards multicore systems. Many of the applications in these domains are streaming in nature. The key challenge is to get efficient implementations of custom hardware from these applications and map these applications efficiently onto multicore architectures. This dissertation presents a unified methodology, referred to as Streamroller, that can be applied for the problem of scheduling stream programs to multicore architectures and to the problem of automatic synthesis of custom hardware for stream applications. Firstly, a method called stream-graph modulo scheduling is presented, which maps stream programs effectively onto a multicore architecture. Many aspects of a real system, like limited memory and explicit DMAs are modeled in the scheduler. The scheduler is evaluated for a set of stream programs on IBM's Cell processor. Secondly, an automated high-level synthesis system for creating custom hardware for stream applications is presented. The template for the custom hardware is a pipeline of accelerators. The synthesis involves designing loop accelerators for individual kernels, instantiating buffers to store data passed between kernels, and linking these building blocks to form a pipeline. A unique aspect of this system is the use of multifunction accelerators, which improves cost by efficiently sharing hardware between multiple kernels. Finally, a method to improve the integer linear program formulations used in the schedulers that exploits symmetry in the solution space is presented. Symmetry-breaking constraints are added to the formulation, and the performance of the solver is evaluated.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61662/1/kvman_1.pd

Deep Blue Documents at the University of Michigan

Synthesis of Multimode digital signal processing systems

Author: Andriamisaina Caaliph
Casseau Emmanuel
Coussy Philippe
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2007
Field of study

International audienceIn this paper, we propose a design methodology for implementing a multimode (or multi-configuration) and multi-throughput system into a single hardware architecture. The inputs of the design flow are the data flow graphs (DFGs), representing the different modes (i.e. the different applications to be implemented), with their respective throughput constraints. While traditional approaches merge DFGs together before the synthesis process, we propose to use ad-hoc scheduling and binding steps during the synthesis of each DFG. The scheduling, which assigns operations to specific time steps, maximizes the similarity between the control steps and thus decreases the controller complexity. The binding process, which assigns operations to specific functional units and data to specific storage elements, maximizes the similarity between datapaths and thus minimizes steering logic and register overhead. First results show the interest of the proposed synthesis flow

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

HAL-Université de Bretagne Occidentale

HAL-Rennes 1

An Updated Survey of Efficient Hardware Architectures for Accelerating Deep Convolutional Neural Networks

Author: Bussolino Beatrice
Capra Maurizio
Marchisio Alberto
Martina Maurizio
Masera Guido
Shafique Muhammad
Publication venue: 'MDPI AG'
Publication date
Field of study

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Dadu-RBD: Robot Rigid Body Dynamics Accelerator with Multifunctional Pipelines

Author: Chen Xiaoming
Han Yinhe
Yang Yuxin
Publication venue
Publication date: 28/09/2023
Field of study

Rigid body dynamics is a key technology in the robotics field. In trajectory optimization and model predictive control algorithms, there are usually a large number of rigid body dynamics computing tasks. Using CPUs to process these tasks consumes a lot of time, which will affect the real-time performance of robots. To this end, we propose a multifunctional robot rigid body dynamics accelerator, named RBDCore, to address the performance bottleneck. By analyzing different functions commonly used in robot dynamics calculations, we summarize their reuse relationship and optimize them according to the hardware. Based on this, RBDCore can fully reuse common hardware modules when processing different computing tasks. By dynamically switching the dataflow path, RBDCore can accelerate various dynamics functions without reconfiguring the hardware. We design Structure-Adaptive Pipelines for RBDCore, which can greatly improve the throughput of the accelerator. Robots with different structures and parameters can be optimized specifically. Compared with the state-of-the-art CPU, GPU dynamics libraries and FPGA accelerator, RBDCore can significantly improve the performance

arXiv.org e-Print Archive

Performance Modeling of Virtualized Custom Logic Computations

Author: Chamberlain Roger D
Hall Michael J
Publication venue: Washington University Open Scholarship
Publication date: 01/01/2014
Field of study

Virtualization of custom logic computations (i.e., by sharing a fixed function across distinct data streams) provides a means of reusing hardware resources, particularly when resources are limited. This is common practice in traditional processors where more than one user can share processor resources. In this paper, we virtualize a custom logic block using C-slow techniques to support fine-grain context-switching. We then develop and present an analytic model for several performance measures (throughput, latency, input queue occupancy) for both fine-grained and coarse-grained context switching (to a secondary memory). Next, we calibrate the analytic performance model with empirical measurements. We then validate the model via discrete-event simulation and use the model to predict the performance and develop optimal schedules for virtualized logic computations. We present results for a Taylor series expansion of a cosine function with added feedback and an AES encryption cipher

Crossref

Washington University St. Louis: Open Scholarship

Digital signal processor fundamentals and system design

Author: Angoletta M E
Publication venue: CERN
Publication date: 01/01/2008
Field of study

Digital Signal Processors (DSPs) have been used in accelerator systems for more than fifteen years and have largely contributed to the evolution towards digital technology of many accelerator systems, such as machine protection, diagnostics and control of beams, power supply and motors. This paper aims at familiarising the reader with DSP fundamentals, namely DSP characteristics and processing development. Several DSP examples are given, in particular on Texas Instruments DSPs, as they are used in the DSP laboratory companion of the lectures this paper is based upon. The typical system design flow is described; common difficulties, problems and choices faced by DSP developers are outlined; and hints are given on the best solution

CERN Document Server

Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs

Author: Hartwig Anzt
Jack Dongarra
Jakub Kurzak
Mark Gates
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Low power processor architecture and multicore approach for embedded systems

Author: Otani Sugako
大谷寿賀子
Publication venue
Publication date: 28/09/2015
Field of study

13301甲第4319号博士（工学）金沢大学博士論文本文Full 以下に掲載：1.IEICE Transactions Vol. E98-C(7) pp.544-549 2015. IEICE. 共著者： S. Otani, H. Kondo. /2.Reuse 許可エビデンス送

Kanazawa University Repository for Academic Resources