Search CORE

689 research outputs found

FASTCUDA: Open Source FPGA Accelerator &amp; Hardware-Software Codesign Toolset for CUDA Kernels

Author: de la Torre E.()
Lavagno L.()
Lazarescu M.()
Mavroidis I. ()
Papaefstathiou I.()
Papaefstathiou Ioannis(http://users.isc.tuc.gr/~ipapaefstathiou)
Schafer F.()
Παπαευσταθιου Ιωαννης(http://users.isc.tuc.gr/~ipapaefstathiou)
Publication venue: IEEE / Institute of Electrical and Electronics Engineers Incorporated:445 Hoes Lane:Piscataway, NJ 08854:(800)701-4333, (732)981-0060, EMAIL: [email protected], INTERNET: http://www.ieee.org, Fax: (732)981-9667
Publication date: 01/01/2012
Field of study

Using FPGAs as hardware accelerators that communicate with a central CPU is becoming a common practice in the embedded design world but there is no standard methodology and toolset to facilitate this path yet. On the other hand, languages such as CUDA and OpenCL provide standard development environments for Graphical Processing Unit (GPU) programming. FASTCUDA is a platform that provides the necessary software toolset, hardware architecture, and design methodology to efficiently adapt the CUDA approach into a new FPGA design flow. With FASTCUDA, the CUDA kernels of a CUDA-based application are partitioned into two groups with minimal user intervention: those that are compiled and executed in parallel software, and those that are synthesized and implemented in hardware. A modern low power FPGA can provide the processing power (via numerous embedded micro-CPUs) and the logic capacity for both the software and hardware implementations of the CUDA kernels. This paper describes the system requirements and the architectural decisions behind the FASTCUDA approach

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Institutional Repository of the Technical University of Crete

FPGA Based Embedded Multiprocessor Architecture

Author: Bhoyar C.N.
Jadhav Sumedh. S.
Publication venue: Institute for Project Management Pvt. Ltd
Publication date: 31/08/2020
Field of study

Multiprocessor is a typical subject within the Computer architecture field of scope. A new methodology based on practical sessions with real devices and design is proposed. Embedded multiprocessor design presents challenges and opportunities that stem from task coarse granularity and the large number of inputs and outputs for each task. We have therefore designed a new architecture called embedded concurrent computing (ECC), which is implementing on FPGA chip using VHDL. The design methodology is expected to allow scalable embedded multiprocessors for system expansion. In recent decades, two forces have driven the increase of the processor performance: Advances in very large-scale integration (VLSI) technology and Micro architectural enhancements. Therefore, we aim to design the full architecture of an embedded processor for realistic to perform arithmetic, logical, shifting and branching operations. We will be synthesize and evaluated the embedded system based on Xilinx environment. Processor performance is going to be improving through clock speed increases and the clock speed increases and the exploitation of instruction- level parallelism. We will be designing embedded multiprocessor based on Xilinx environment or Modelsim environment

Interscience Research Network

A Fuzzy Logic Reconfiguration Engine for Symmetric Chip Multiprocessors

Author: McDonald-Maier Klaus D
Qadri Muhammad Yasir
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/04/2010
Field of study

Recent developments in reconfigurable multiprocessor system on chip (MPSoC) have offered system designers a great amount of flexibility to exploit task concurrency with higher throughput and less energy consumption. This paper presents a novel fuzzy logic reconfiguration engine (FLRE) for coarse grain MPSoC reconfiguration that facilitates to identify an optimum balance between power and performance of the system. The FLRE is composed on two levels of abstraction layers. The system selects an optimal configuration of Level 1 / Level 2 cache size and Associativity, processor operating frequency and voltage, the number of cores based on miss rate, and energy and throughput information of the system both at core and SoC level. An 8-core symmetric chip multiprocessor has been used to evaluate the proposed scheme. The results show an overall decrease of energy consumption with not more than 30% decrease in the throughput

University of Essex Research Repository

Crossref

Automatic parallelization of sequential specifications for symmetric MPSoCs

Author: A. Tumeo
D. Sciuto
F. Ferrandi
G. Palermo
L. Fossati
M. Lattuada
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

Implementation of Asymmetric Multiprocessing Support in a Real-Time Operating System

Author: Elamin Elsheikh Islam Abdalla
Publication venue: 'Whiting & Birch, Ltd.'
Publication date: 01/04/2016
Field of study

The semiconductor industry can no longer afford to rely on decreasing the size of the die, and increasing the frequency of operation to achieve higher performance. An alternative that has been proven to increase performance is multiprocessing. Multiprocessing refers to the concept of running more than one application or task on more than one central processor. Multi-core processors are the main engine of multiprocessing. In asymmetric multiprocessing, each core in a multi-core systems is independent and has its own code that determines its execution. These cores must be able to communicate and synchronize access to resources

UTPedia

Design and resource management of reconfigurable multiprocessors for data-parallel applications

Author: Wang Xiaofang
Publication venue: Digital Commons @ NJIT
Publication date: 31/01/2006
Field of study

FPGA (Field-Programmable Gate Array)-based custom reconfigurable computing machines have established themselves as low-cost and low-risk alternatives to ASIC (Application-Specific Integrated Circuit) implementations and general-purpose microprocessors in accelerating a wide range of computation-intensive applications. Most often they are Application Specific Programmable Circuiits (ASPCs), which are developer programmable instead of user programmable. The major disadvantages of ASPCs are minimal programmability, and significant time and energy overheads caused by required hardware reconfiguration when the problem size outnumbers the available reconfigurable resources; these problems are expected to become more serious with increases in the FPGA chip size. On the other hand, dominant high-performance computing systems, such as PC clusters and SMPs (Symmetric Multiprocessors), suffer from high communication latencies and/or scalability problems. This research introduces low-cost, user-programmable and reconfigurable MultiProcessor-on-a-Programmable-Chip (MPoPC) systems for high-performance, low-cost computing. It also proposes a relevant resource management framework that deals with performance, power consumption and energy issues. These semi-customized systems reduce significantly runtime device reconfiguration by employing userprogrammable processing elements that are reusable for different tasks in large, complex applications. For the sake of illustration, two different types of MPoPCs with hardware FPUs (floating-point units) are designed and implemented for credible performance evaluation and modeling: the coarse-grain MIMD (Multiple-Instruction, Multiple-Data) CG-MPoPC machine based on a processor IP (Intellectual Property) core and the mixed-mode (MIMD, SIMD or M-SIMD) variant-grain HERA (HEterogeneous Reconfigurable Architecture) machine. In addition to alleviating the above difficulties, MPoPCs can offer several performance and energy advantages to our data-parallel applications when compared to ASPCs; they are simpler and more scalable, and have less verification time and cost. Various common computation-intensive benchmark algorithms, such as matrix-matrix multiplication (MMM) and LU factorization, are studied and their parallel solutions are shown for the two MPoPCs. The performance is evaluated with large sparse real-world matrices primarily from power engineering. We expect even further performance gains on MPoPCs in the near future by employing ever improving FPGAs. The innovative nature of this work has the potential to guide research in this arising field of high-performance, low-cost reconfigurable computing. The largest advantage of reconfigurable logic lies in its large degree of hardware customization and reconfiguration which allows reusing the resources to match the computation and communication needs of applications. Therefore, a major effort in the presented design methodology for mixed-mode MPoPCs, like HERA, is devoted to effective resource management. A two-phase approach is applied. A mixed-mode weighted Task Flow Graph (w-TFG) is first constructed for any given application, where tasks are classified according to their most appropriate computing mode (e.g., SIMD or MIMD). At compile time, an architecture is customized and synthesized for the TFG using an Integer Linear Programming (ILP) formulation and a parameterized hardware component library. Various run-time scheduling schemes with different performanceenergy objectives are proposed. A system-level energy model for HERA, which is based on low-level implementation data and run-time statistics, is proposed to guide performance-energy trade-off decisions. A parallel power flow analysis technique based on Newton\u27s method is proposed and employed to verify the methodology

Digital Commons @ New Jersey Institute of Technology (NJIT)

Scalable Interactive Volume Rendering Using Off-the-shelf Components

Author: Breen David
Heirich Alan
Lombeyda Santiago
Moll Laurent
Shand Mark
Publication venue: 'California Institute of Technology Library'
Publication date: 01/01/2001
Field of study

This paper describes an application of a second generation implementation of the Sepia architecture (Sepia-2) to interactive volu-metric visualization of large rectilinear scalar fields. By employingpipelined associative blending operators in a sort-last configuration a demonstration system with 8 rendering computers sustains 24 to 28 frames per second while interactively rendering large data volumes (1024x256x256 voxels, and 512x512x512 voxels). We believe interactive performance at these frame rates and data sizes is unprecedented. We also believe these results can be extended to other types of structured and unstructured grids and a variety of GL rendering techniques including surface rendering and shadow map-ping. We show how to extend our single-stage crossbar demonstration system to multi-stage networks in order to support much larger data sizes and higher image resolutions. This requires solving a dynamic mapping problem for a class of blending operators that includes Porter-Duff compositing operators

CiteSeerX

Caltech Authors

Partitioning and Mapping for the hArtes European Project

Author: A. Tumeo
D. Sciuto
F. Ferrandi
G. Palermo
L. Fossati
M. Lattuada
Publication venue
Publication date: 01/01/2007
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano