Search CORE

1,792 research outputs found

Design of a parallel vector access unit for SDRAM memory systems

Author: Carter John B.
Mathew Binu K.
Publication venue: University of Utah
Publication date: 01/01/1999
Field of study

Journal ArticleParallel Vector Access is a technique that exploits the regularity of vector or stream accesses to perform them efficiently in parallel on a multi-bank memory system. The performance of applications that have vector accesses may be improved using a memory controller that performs scatter/gather operations so that only the vector or stream elements that are accessed by the application are transmitted across the system bus. These scatter/gather operations can be speeded up by broadcasting vector operations to all banks of memory in parallel, each of which implements an algorithm to determine which elements of the requested vector they contain. This thesis presents the mathematical foundations behind one such algorithm for controller are investigated. The the performance of such a memory controller on vector kernels is studied by gate level simulation and the results analyzed. Because of the parallel approach, the PVA is able to load elements up to 32.8 times faster than a conventional memory system and 3.3 times faster than a pipelined vector unit, without hurting normal cache line fill performance

The University of Utah: J. Willard Marriott Digital Library

Command vector memory systems: high performance at low cost

Author: Corbal San Adrián Jesús
Espasa Sans Roger
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

The focus of this paper is on designing both a low cost and high performance, high bandwidth vector memory system that takes advantage of modern commodity SDRAM memory chips. To successfully extract the full bandwidth from SDRAM parts, we propose a new memory system organization based on sending commands to the memory system as opposed to sending individual addresses. A command specifies, in a few bytes, a request for multiple independent memory words. A command is similar to a burst found in DRAM memories, but does not require the memory words to be consecutive. The command is sent to all sections of the memory array simultaneously, thus not requiring a crossbar in the proper sense. Our simulations show that this command based memory system can improve performance over a traditional SDRAM-based memory system by factors that range between 1.15 up to 1.54. Moreover, in many cases, the command memory system outperforms even the best SRAM memory system under consideration. Overall the command based memory system achieves similar or better results than a 10 ns SRAM memory system (a) using fewer banks and (b) using memory devices that are between 15 to 60 times cheaper.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Memory controller for vector processor

Author: Ayguadé Parra Eduard
Cristal Kestelman Adrián
Hussain Tassadaq
Palomar Oscar
Unsal Osman Sabri
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

To manage power and memory wall affects, the HPC industry supports FPGA reconfigurable accelerators and vector processing cores for data-intensive scientific applications. FPGA based vector accelerators are used to increase the performance of high-performance application kernels. Adding more vector lanes does not affect the performance, if the processor/memory performance gap dominates. In addition if on/off-chip communication time becomes more critical than computation time, causes performance degradation. The system generates multiple delays due to application’s irregular data arrangement and complex scheduling scheme. Therefore, just like generic scalar processors, all sets of vector machine – vector supercomputers to vector microprocessors – are required to have data management and access units that improve the on/off-chip bandwidth and hide main memory latency. In this work, we propose an Advanced Programmable Vector Memory Controller (PVMC), which boosts noncontiguous vector data accesses by integrating descriptors of memory patterns, a specialized on-chip memory, a memory manager in hardware, and multiple DRAM controllers. We implemented and validated the proposed system on an Altera DE4 FPGA board. The PVMC is also integrated with ARM Cortex-A9 processor on Xilinx Zynq All-Programmable System on Chip architecture. We compare the performance of a system with vector and scalar processors without PVMC. When compared with a baseline vector system, the results show that the PVMC system transfers data sets up to 1.40x to 2.12x faster, achieves between 2.01x to 4.53x of speedup for 10 applications and consumes 2.56 to 4.04 times less energy.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Digital.CSIC

AMC: Advanced Multi-accelerator Controller

Author: Ayguadé Parra Eduard
Gursal Shakaib A.
Haider Amna
Hussain Tassadaq
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

The rapid advancement, use of diverse architectural features and introduction of High Level Synthesis (HLS) tools in FPGA technology have enhanced the capacity of data-level parallelism on a chip. A generic FPGA based HLS multi-accelerator system requires a microprocessor (master core) that manages memory and schedules accelerators. In a real environment, such HLS multi-accelerator systems do not give a perfect performance due to memory bandwidth issues. Thus, a system demands a memory manager and a scheduler that improves performance by managing and scheduling the multi-accelerator’s memory access patterns efficiently. In this article, we propose the integration of an intelligent memory system and efficient scheduler in the HLS-based multi-accelerator environment called Advanced Multi-accelerator Controller (AMC). The AMC system is evaluated with memory intensive accelerators, High Performance Computing (HPC) applications and implemented and tested on a Xilinx Virtex-5 ML505 evaluation FPGA board. The performance of the system is compared against the microprocessor-based systems that have been integrated with the operating system. Results show that the AMC based HLS multi-accelerator system achieves 10.4x and 7x of speedup compared to the MicroBlaze and Intel Core based HLS multi-accelerator systems.Peer ReviewedPostprint (author’s final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Recommended from our members

The VolumePro Real-Time Ray-Casting System

Author: Hardenbergh Jan
Knittel Jim
Lauer Hugh
Pfister Hanspeter
Seiler Larry
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/06/2010
Field of study

This paper describes VolumePro, the world’s first single-chip realtime volume rendering system for consumer PCs. VolumePro implements ray-casting with parallel slice-by-slice processing. Our discussion of the architecture focuses mainly on the rendering pipeline and the memory organization. VolumePro has hardware for gradient estimation, classification, and per-sample Phong illumination. The system does not perform any pre-processing and makes parameter adjustments and changes to the volume data immediately visible. We describe several advanced features of VolumePro, such as gradient magnitude modulation of opacity and illumination, supersampling, cropping and cut planes. The system renders 500 million interpolated, Phong illuminated, composited samples per second. This is sufficient to render volumes with up to 16 million voxels (e.g., 2563) at 30 frames per second.Engineering and Applied Science

Harvard University - DASH

OPTIMIZED ARCHITECTURE DESIGN AND IMPLEMENTATION OF OBJECT TRACKING ALGORITHM ON FPGA

Author: ABDELFATAH EL-KHATIB LINA NOAMAN
Publication venue
Publication date: 01/01/2013
Field of study

FPGA based Object tracking implementation is one of the most recent video surveillance applications in embedded systems. In general, FPGA implementation is more efficient than general purpose computers in attaining high throughput due to its parallelism and execution speed. The system need to be designed on a standard frame rate in such a way to achieve optimal performance in real time environment. Optimal design of a system is dependent on minimizing the cost, area (device utility) and power while achieving the required speed. Past research work that investigated object tracking systems' implementation on FPGA achieved a significantly high throughput but have shown high device utilization. This research work aims at optimizing the device utilization under real time constraints. The Adaptive Hybrid Difference algorithm (AHD), which is used to detect the moving objects, was chosen to be implemented on FPGA due to its computation ability and efficiency with regard to hardware implementation. AHD can work at various lighting conditions automatically by determining the adaptive threshold in every period of time

UTPedia

FPGA Implementation of Convolutional Neural Networks with Fixed-Point Calculations

Author: Kalinin Alexandr A.
Kustov Alexander G.
Ruhlov Vladimir S.
Solovyev Roman A.
Telpukhov Dmitry V.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/08/2018
Field of study

Neural network-based methods for image processing are becoming widely used in practical applications. Modern neural networks are computationally expensive and require specialized hardware, such as graphics processing units. Since such hardware is not always available in real life applications, there is a compelling need for the design of neural networks for mobile devices. Mobile neural networks typically have reduced number of parameters and require a relatively small number of arithmetic operations. However, they usually still are executed at the software level and use floating-point calculations. The use of mobile networks without further optimization may not provide sufficient performance when high processing speed is required, for example, in real-time video processing (30 frames per second). In this study, we suggest optimizations to speed up computations in order to efficiently use already trained neural networks on a mobile device. Specifically, we propose an approach for speeding up neural networks by moving computation from software to hardware and by using fixed-point calculations instead of floating-point. We propose a number of methods for neural network architecture design to improve the performance with fixed-point calculations. We also show an example of how existing datasets can be modified and adapted for the recognition task in hand. Finally, we present the design and the implementation of a floating-point gate array-based device to solve the practical problem of real-time handwritten digit classification from mobile camera video feed

arXiv.org e-Print Archive

Crossref

A Memory Controller for FPGA Applications

Author: Hunter Bryan Jacob
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/08/2012
Field of study

As designers and researchers strive to achieve higher performance, field-programmable gate arrays (FPGAs) become an increasingly attractive solution. As coprocessors, FPGAs can provide application specific acceleration that cannot be matched by modern processors. Most of these applications will make use of large data sets, so achieving acceleration will require a capable interface to this data. The research in this thesis describes the design of a memory controller that is both efficient and flexible for FPGA applications requiring floating point operations. In particular, the benefits of certain design choices are explored, including: scalability, memory caching, and configurable precision. Results are given to prove the controller\u27s effectiveness and to compare various design trade-offs

University of Tennessee, Knoxville: Trace

The development of a node for a hardware reconfigurable parallel processor

Author: Van Schaik Carl Frans
Publication venue: Department of Electrical Engineering
Publication date: 01/01/2002
Field of study

This dissertation concerns the design and implementation of a node for a hardware reconfigurable parallel processor. The hardware that was developed allows for the further development of a parallel processor with configurable hardware acceleration. Each node in the system has a standard microprocessor and reconfigurable logic device and has high speed communications channels for inter-node communication. The design of the node provided high-speed serial communications channels allowing the implementation of various network topographies. The node also provided a PCI master interface to provide an external interface and communicate with local nodes on the bus. A high speed RlSC processor provided communication and system control functions and the reconfigurable logic device provided communication interfaces and data processing functions. The node was designed and implemented as a PCI card that interfaced a standard PCI bus. VHDL designs for logic devices that provided system support were developed, VHDL designs for the reconfigurable logic FPGA and software including drivers and system software were written for the node. The 64-bit version Linux operating system was then ported to the processor providing a UNIX environment for the system. The node functioned as specified and parallel and hardware accelerated processing was demonstrated. The hardware acceleration was shown to provide substantial performance benefits for the system

Cape Town University OpenUCT