Search CORE

397 research outputs found

Dataflow Computing with Polymorphic Registers

Author: Ciobanu Catalin
Gaydadjiev Georgi N.
Pilato Christian
Sciuto Donatella
Publication venue
Publication date: 01/01/2013
Field of study

Heterogeneous systems are becoming increasingly popular for data processing. They improve performance of simple kernels applied to large amounts of data. However, sequential data loads may have negative impact. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high speed, parallel access to performance-critical data. Furthermore, by PRF customization, specific data path features are exposed to the programmer in a very convenient way. PRFs allow additional control over the registers dimensions, and the number of elements which can be simultaneously accessed by computational units. This paper shows how PRFs can be integrated in dataflow computational platforms. In particular, starting from an annotated source code, we present a compiler-based methodology that automatically generates the customized PRFs and the enhanced computational kernels that efficiently exploit them

Archivio istituzionale della ricerca - Politecnico di Milano

Chalmers Research

Chalmers Publication Library

The Case for Polymorphic Registers in Dataflow Computing

Author: Ciobanu Cătălin Bogdan
Gaydadjiev Georgi
Pilato Christian
Sciuto Donatella
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Heterogeneous systems are becoming increasingly popular, delivering high performance through hardware specialization. However, sequential data accesses may have a negative impact on performance. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high-speed, parallel access to performance-critical data. This article shows how PRFs can be integrated into dataflow computational platforms. Our semi-automatic, compiler-based methodology generates customized PRFs and modifies the computational kernels to efficiently exploit them. We use a separable 2D convolution case study to evaluate the impact of memory latency and bandwidth on performance compared to a state-of-the-art NVIDIA Tesla C2050 GPU. We improve the throughput up to 56.17X and show that the PRF-augmented system outperforms the GPU for 9×9 or larger mask sizes, even in bandwidth-constrained systems

Archivio istituzionale della ricerca - Politecnico di Milano

UvA-DARE

GPGPU microbenchmarking for irregular application optimization

Author: Winans-Pruitt Dalton R.
Publication venue: Scholars Junction
Publication date: 09/08/2022
Field of study

Irregular applications, such as unstructured mesh operations, do not easily map onto the typical GPU programming paradigms endorsed by GPU manufacturers, which mostly focus on maximizing concurrency for latency hiding. In this work, we show how alternative techniques focused on latency amortization can be used to control overall latency while requiring less concurrency. We used a custom-built microbenchmarking framework to test several GPU kernels and show how the GPU behaves under relevant workloads. We demonstrate that coalescing is not required for efficacious performance; an uncoalesced access pattern can achieve high bandwidth - even over 80% of the theoretical global memory bandwidth in certain circumstances. We also make other further observations on specific relevant behaviors of GPUs. We hope that this study opens the door for further investigation into techniques that can exploit latency amortization when latency hiding does not achieve sufficient performance

Scholars Junction - Mississippi State University Institutional Repository

The Case for Polymorphic Registers in Dataflow Computing

Author: Ciobanu C.B.
Gaydadjiev G.
Pilato C.
Sciuto D.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2018
Field of study

International Migration, Integration and Social Cohesion online publications

EIE: Efficient Inference Engine on Compressed Deep Neural Network

Author: Dally William J.
Han Song
Horowitz Mark A.
Liu Xingyu
Mao Huizi
Pedram Ardavan
Pu Jing
Publication venue
Publication date: 03/05/2016
Field of study

State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x; Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.Comment: External Links: TheNextPlatform: http://goo.gl/f7qX0L ; O'Reilly: https://goo.gl/Id1HNT ; Hacker News: https://goo.gl/KM72SV ; Embedded-vision: http://goo.gl/joQNg8 ; Talk at NVIDIA GTC'16: http://goo.gl/6wJYvn ; Talk at Embedded Vision Summit: https://goo.gl/7abFNe ; Talk at Stanford University: https://goo.gl/6lwuer. Published as a conference paper in ISCA 201

arXiv.org e-Print Archive

Crossref

Non-power-of-Two FFTs: Exploring the Flexibility of the Montium TP

Author: Hauck S.A.
Smit Gerardus Johannes Maria
van de Burgwal M.D.
Wolkotte P.T.
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2009
Field of study

Coarse-grain reconfigurable architectures, like the Montium TP, have proven to be a very successful approach for low-power and high-performance computation of regular digital signal processing algorithms. This paper presents the implementation of a class of non-power-of-two FFTs to discover the limitations and Flexibility of the Montium TP for less regular algorithms. A non-power-of-two FFT is less regular compared to a traditional power-of-two FFT. The results of the implementation show the processing time, accuracy, energy consumption and Flexibility of the implementation

CiteSeerX

Crossref

Directory of Open Access Journals

University of Twente Research Information

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Author: Ahmadi Agreen
Austin Todd
Behroozi Armand
Dreslinski Ronald
Kaszyk Kuba
Li Lu
Mahlke Scott
May Kyle
Morton John Magnus
Mudge Trevor
Nguyen Brandon
O'Boyle Michael F P
Sun Jiawen
Talati Nishil
Vasiladiotis Christos
Verma Tarunesh
Yang Yichen
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/04/2021
Field of study

Edinburgh Research Explorer