Search CORE

689 research outputs found

Parametric data-parallel architectures for TLM acceleration

Author: James Flint (1251738)
Vassilios Chouliaras (1251600)
Yibin Li (1526803)
Publication venue
Publication date: 01/01/2004
Field of study

We discuss the architecture and microarchitecture of a scalable, parametric vector accelerator for the TLM algorithm. Architecture-level experimentation demonstrates an order of magnitude complexity reduction for vector lengths of 16 32-bit single-precision elements. We envisage the proposed architecture replicated in a SOC environment thus, forming a multiprocessor system capable of tapping parallelism at the thread level as well as the data level

Loughborough University Institutional Repository

VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

Author: David Stevens (4254274)
Vassilios Chouliaras (1251600)
Vincent Dwyer (1251447)
Publication venue
Publication date: 01/01/2016
Field of study

We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2 performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support

Loughborough University Institutional Repository

Reconfiguration for Fault Tolerance and Performance Analysis

Author: Kollmeier Harold Henry
Publication venue: ScholarlyCommons
Publication date: 01/11/1987
Field of study

Architecture reconfiguration, the ability of a system to alter the active interconnection among modules, has a history of different purposes and strategies. Its purposes develop from the relatively simple desire to formalize procedures that all processes have in common to reconfiguration for the improvement of fault-tolerance, to reconfiguration for performance enhancement, either through the simple maximizing of system use or by sophisticated notions of wedding topology to the specific needs of a given process. Strategies range from straightforward redundancy by means of an identical backup system to intricate structures employing multistage interconnection networks. The present discussion surveys the more important contributions to developments in reconfigurable architecture. The strategy here is in a sense to approach the field from an historical perspective, with the goal of developing a more coherent theory of reconfiguration. First, the Turing and von Neumann machines are discussed from the perspective of system reconfiguration, and it is seen that this early important theoretical work contains little that anticipates reconfiguration. Then some early developments in reconfiguration are analyzed, including the work of Estrin and associates on the fixed plus variable restructurable computer system, the attempt to theorize about configurable computers by Miller and Cocke, and the work of Reddi and Feustel on their restructable computer system. The discussion then focuses on the most sustained systems for fault tolerance and performance enhancement that have been proposed. An attempt will be made to define fault tolerance and to investigate some of the strategies used to achieve it. By investigating four different systems, the Tandern computer, the C.vmp system, the Extra Stage Cube, and the Gamma network, the move from dynamic redundancy to reconfiguration is observed. Then reconfiguration for performance enhancement is discussed. A survey of some proposals is attempted, then the discussion focuses on the most sustained systems that have been proposed: PASM, the DC architecture, the Star local network, and the NYU Ultracomputer. The discussion is organized around a comparison of control, scheduling, communication, and network topology. Finally, comparisons are drawn between fault tolerance and performance enhancement, in order to clarify the notion of reconfiguration and to reveal the common ground of fault tolerance and performance enhancement as well as the areas in which they diverge. An attempt is made in the conclusion to derive from this survey and analysis some observations on the nature of reconfiguration, as well as some remarks on necessary further areas of research

CiteSeerX

ScholarlyCommons@Penn

Design of testbed and emulation tools

Author: Flynn M. J.
Lundstrom S. F.
Publication venue
Publication date
Field of study

The research summarized was concerned with the design of testbed and emulation tools suitable to assist in projecting, with reasonable accuracy, the expected performance of highly concurrent computing systems on large, complete applications. Such testbed and emulation tools are intended for the eventual use of those exploring new concurrent system architectures and organizations, either as users or as designers of such systems. While a range of alternatives was considered, a software based set of hierarchical tools was chosen to provide maximum flexibility, to ease in moving to new computers as technology improves and to take advantage of the inherent reliability and availability of commercially available computing systems

NASA Technical Reports Server

Parallel machine architecture and compiler design facilities

Author: Kuck David J.
Padua David
Sameh Ahmed
Veidenbaum Alex
Yew Pen-Chung
Publication venue
Publication date
Field of study

The objective is to provide an integrated simulation environment for studying and evaluating various issues in designing parallel systems, including machine architectures, parallelizing compiler techniques, and parallel algorithms. The status of Delta project (which objective is to provide a facility to allow rapid prototyping of parallelized compilers that can target toward different machine architectures) is summarized. Included are the surveys of the program manipulation tools developed, the environmental software supporting Delta, and the compiler research projects in which Delta has played a role

NASA Technical Reports Server

Support for Programming Models in Network-on-Chip-based Many-core Systems

Author: Rasmussen Morten Sleth
Publication venue: Technical University of Denmark
Publication date: 01/09/2010
Field of study

Online Research Database In Technology

Dynamic Scheduling, Allocation, and Compaction Scheme for Real-Time Tasks on FPGAs

Author: Tatineni Shobharani
Publication venue: LSU Digital Commons
Publication date: 01/01/2001
Field of study

Run-time reconfiguration (RTR) is a method of computing on reconfigurable logic, typically FPGAs, changing hardware configurations from phase to phase of a computation at run-time. Recent research has expanded from a focus on a single application at a time to encompass a view of the reconfigurable logic as a resource shared among multiple applications or users. In real-time system design, task deadlines play an important role. Real-time multi-tasking systems not only need to support sharing of the resources in space, but also need to guarantee execution of the tasks. At the operating system level, sharing logic gates, wires, and I/O pins among multiple tasks needs to be managed. From the high level standpoint, access to the resources needs to be scheduled according to task deadlines. This thesis describes a task allocator for scheduling, placing, and compacting tasks on a shared FPGA under real-time constraints. Our consideration of task deadlines is novel in the setting of handling multiple simultaneous tasks in RTR. Software simulations have been conducted to evaluate the performance of the proposed scheme. The results indicate significant improvement by decreasing the number of tasks rejected

Louisiana State University

Heterogeneous vs Homogeneous MPSoC Approaches for a Mobile LTE Modem

Author: Benoit Pascal
Jalier Camille
Jerraya A.A.
Lattard Didier
Sassatelli Gilles
Torres Lionel
Publication venue: HAL CCSD
Publication date: 01/01/2010
Field of study

International audienceApplications like 4G baseband modem require single-chip implementation to meet the integration and power consumption requirements. These applications demand a high computing performance with real-time constraints, low-power consumption and low cost. With the rapid evolution of tele- com standards and the increasing demand for multi-standard products, the need for ﬂexible baseband solutions is growing. The concept of Multi-Processor System-on-Chip (MPSoC) is well adapted to enable hardware reuse between products and between multiple wireless standards in the same device. Heterogeneous architectures are well known solutions but they have limited ﬂexibility. Based on the experience of two heterogeneous Software De- ﬁned Radio (SDR) telecom chipsets, this paper presents the homoGENEous Processor arraY (GENEPY) platform for 4G ap- plications. This platform is built with SMEP units interconnected with a Network-on-Chip. The SMEP, implemented in 65nm low- power CMOS, can perform 3.2 GMAC/s with 77 GBits/s internal bandwidth at 400MHz. Two implementations of homogeneous GENEPY are compared to the heterogeneous MAGALI platform in terms of silicon area, performance and power consumption. Results show that a homogeneous approach can be more efﬁcient and ﬂexible than a heterogeneous approach in the context of 4G Mobile Terminals

CiteSeerX

Crossref

HAL-CEA

Flexible and Distributed Real-Time Control on a 4G Telecom MPSoC

Author: Benoit Pascal
Jalier Camille
Lattard Didier
Sassatelli Gilles
Torres Lionel
Publication venue: HAL CCSD
Publication date: 07/07/2010
Field of study

International audienceApplications like 4G baseband modem require single-chip implementation to meet the integration and power consumption requirements. These applications demand a high computing performance with real-time constraints, low-power consumption and low cost. With the rapid evolution of telecom standards and the increasing demand for multi-standard products, the need for exible baseband solutions is growing. The concept of Multi-Processor System-on-Chip (MPSoC) is well adapted to enable hardware reuse between products and between multiple wireless standards in the same device. Based on the experience of two heterogeneous Software Defined Radio (SDR) telecom chipsets, this paper presents a distributed control architecture for the homoGENEous Processor arraY (GENEPY) platform for 4G applications. This MPSoC platform is built with telecom baseband processors interconnected with a Network-on-Chip. The control is performed by a MIPS processor embedded in each baseband processor. This control processor can locally reconfigure and schedule the applications with real-time telecom constraints

HAL-CEA

Definition of a Method for the Formulation of Problems to be Solved with High Performance Computing

Author: Peruri Ramya
Publication venue: DigitalCommons@Kennesaw State University
Publication date: 02/08/2016
Field of study

Computational power made available by current technology has been continuously increasing, however today’s problems are larger and more complex and demand even more computational power. Interest in computational problems has also been increasing and is an important research area in computer science. These complex problems are solved with computational models that use an underlying mathematical model and are solved using computer resources, simulation, and are run with High Performance Computing. For such computations, parallel computing has been employed to achieve high performance. This thesis identifies families of problems that can best be solved using modelling and implementation techniques of parallel computing such as message passing and shared memory. Few case studies are considered to show when the shared memory model is suitable and when the message passing model would be suitable. The models of parallel computing are implemented and evaluated using some algorithms and simulations. This thesis mainly focuses on showing the more suitable model of computing for the various scenarios in attaining High Performance Computing

DigitalCommons@Kennesaw State University