Search CORE

4,996 research outputs found

Compiler-managed memory system for software-exposed architectures

Author: Barua Rajeev K. (Rajeev Kumar)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2000
Field of study

Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2000.Includes bibliographical references (p. 155-161).Microprocessors must exploit both instruction-level parallelism (ILP) and memory parallelism for high performance. Sophisticated techniques for ILP have boosted the ability of modern-day microprocessors to exploit ILP when available. Unfortunately, improvements in memory parallelism in microprocessors have lagged behind. This thesis explains why memory parallelism is hard to exploit in microprocessors and advocate bank-exposed architectures as an effective way to exploit more memory parallelism. Bank exposed architectures are a kind of software-exposed architecture: one in which the low level details of the hardware are visible to the software. In a bank-exposed architecture, the memory banks are visible to the software, enabling the compiler to exploit a high degree of memory parallelism in addition to ILP. Bank-exposed architectures can be employed by general-purpose processors, and by embedded chips, such as those used for digital-signal processing. This thesis presents Maps, an enabling compiler technology for bank-exposed architectures. Maps solves the problem of bank-disambiguation, i.e., how to distribute data in sequential programs among several banks to best exploit memory parallelism, while retaining the ability to disambiguate each data reference to a particular bank. Two methods for bank disambiguation are presented: equivalence-class unification and modulo unrolling. Taking a sequential program as input, a bank-disambiguation method produces two outputs: first, a distribution of each program object among the memory banks; and second, a bank number for every reference that can be proven to access a single, known bank for that data distribution. Finally, the thesis shows why non-disambiguated accesses are sometimes desirable. Dependences between disambiguated and non-disambiguated accesses are enforced through explicit synchronization and software serial ordering. The MIT Raw machine is an example of a software-exposed architecture. Raw exposes its ILP, memory and communication mechanisms. The Maps system has been implemented in the Raw compiler. Results on Raw using sequential codes demonstrate that using bank disambiguation in addition to ILP improves performance by a factor of 3 to 5 over using ILP alone.by Rajeev Barua.Ph.D

DSpace@MIT

Distributed data cache designs for clustered VLIW processors

Author: Gibert Codina Enric
González Colás Antonio María
Sánchez Jesús
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

Wire delays are a major concern for current and forthcoming processors. One approach to deal with this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the L1 data cache typically remains centralized in What we call partially distributed architectures. However, as technology evolves, the relative latency of such a centralized cache will increase, leading to an important impact on performance. In this paper, we propose partitioning the L1 data cache among clusters for clustered VLIW processors. We refer to this kind of design as fully distributed processors. In particular; we propose and evaluate three different configurations: a snoop-based cache coherence scheme, a word-interleaved cache, and flexible LO-buffers managed by the compiler. For each alternative, instruction scheduling techniques targeted to cyclic code are developed. Results for the Mediabench suite'show that the performance of such fully distributed architectures is always better than the performance of a partially distributed one with the same amount of resources. In addition, the key aspects of each fully distributed configuration are explored.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

An occam Style Communications System for UNIX Networks

Author: Vella Kevin J.
Publication venue: University of Kent, Computing Laboratory
Publication date: 01/12/1995
Field of study

This document describes the design of a communications system which provides occam style communications primitives under a Unix environment, using TCP/IP protocols, and any number of other protocols deemed suitable as underlying transport layers. The system will integrate with a low overhead scheduler/kernel without incurring significant costs to the execution of processes within the run time environment. A survey of relevant occam and occam3 features and related research is followed by a look at the Unix and TCP/IP facilities which determine our working constraints, and a description of the T9000 transputer's Virtual Channel Processor, which was instrumental in our formulation. Drawing from the information presented here, a design for the communications system is subsequently proposed. Finally, a preliminary investigation of methods for lightweight access control to shared resources in an environment which does not provide support for critical sections, semaphores, or busy waiting, is made. This is presented with relevance to mutual exclusion problems which arise within the proposed design. Future directions for the evolution of this project are discussed in conclusion

Kent Academic Repository

A Language and Hardware Independent Approach to Quantum-Classical Computing

Author: Chen Mengsu
Dumitrescu Eugene F.
Feng Wu-chun
Humble Travis S.
Liakh Dmitry
McCaskey Alexander J.
Publication venue
Publication date: 01/01/2018
Field of study

Heterogeneous high-performance computing (HPC) systems offer novel architectures which accelerate specific workloads through judicious use of specialized coprocessors. A promising architectural approach for future scientific computations is provided by heterogeneous HPC systems integrating quantum processing units (QPUs). To this end, we present XACC (eXtreme-scale ACCelerator) --- a programming model and software framework that enables quantum acceleration within standard or HPC software workflows. XACC follows a coprocessor machine model that is independent of the underlying quantum computing hardware, thereby enabling quantum programs to be defined and executed on a variety of QPUs types through a unified application programming interface. Moreover, XACC defines a polymorphic low-level intermediate representation, and an extensible compiler frontend that enables language independent quantum programming, thus promoting integration and interoperability across the quantum programming landscape. In this work we define the software architecture enabling our hardware and language independent approach, and demonstrate its usefulness across a range of quantum computing models through illustrative examples involving the compilation and execution of gate and annealing-based quantum programs

arXiv.org e-Print Archive

Directory of Open Access Journals

PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation

Author: Ahmed Fasih
Andreas Klöckner
Bell
Bryan Catanzaro
Buck
Chandler
Dalcín
Eich
Feldman
Flanagan
Frigo
Group
Hestenes
Hesthaven
Kennedy
Klöckner
Lam
Langtangen
Lindholm
McCarthy
McCool
Nicolas Pinto
Oliphant
Owens
Paul Ivanov
Pinto
Pinto
Prud’homme
Reynders
Seiler
Stein
Valiant
van Hateren
Veldhuizen
Wang
Whaley
Yunsup Lee
Publication venue: 'Elsevier BV'
Publication date: 29/03/2011
Field of study

High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that support this technique. In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie

arXiv.org e-Print Archive

Crossref

Coarse-grained reconfigurable array architectures

Author: A Lambrechts
B Bougard
B Bougard
B Mei
B Mei
B Mei
B Sutter De
G Venkataramani
H Park
H Park
J Lee
JMP Cardoso
JW Waerdt van de
K Berkel van
K Bondalapati
K Sankaralingam
KE Coons
LH Lee
M Ahn
M Gebhart
M Schlansker
M Taylor
M Woh
MD Galanis
MH Lee
S Friedman
SA Mahlke
T Oh
Y Kim
Y Kim
Y Kim
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Coarse-Grained Reconﬁgurable Array (CGRA) architectures accelerate the same inner loops that beneﬁt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efﬁciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ﬂexibility, performance, and power-efﬁciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ﬁne-tuning of source code

Crossref

Ghent University Academic Bibliography