Search CORE

153 research outputs found

A comparison of data prefetching on an access decoupled and superscalar machine

Author: Jones G.P.
Topham N.P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1997
Field of study

In this paper we investigate the behavior of data prefetching on an access decoupled machine and a superscalar machine. We assess if there are benefits to using the decoupling paradigm given that an out-oforder (o-o-o) superscalar architecture could in principle prefetch to the same degree as an access decoupled machine. We have found that for large issue width the access decoupled machine can hide memory latency more effectively than a single instruction window o-o-o superscalar architecture. Our findings also demonstrate that an access decoupled machine offers the benefit of reducing the complexity of window issue logic. 1 Introduction The future of high performance microprocessor design is to provide improved performance by extracting higher degrees of instruction level parallelism. In superscalar architectures parallelism is exploited by reordering instructions within an instruction window and issuing multiple independent instructions per cycle. However as processor speeds increa..

CiteSeerX

Crossref

Edinburgh Research Explorer

Limits of a decoupled out-of-order superscalar architecture

Author: Jones Graham P.
Publication venue: The University of Edinburgh
Publication date: 01/01/1999
Field of study

Edinburgh Research Archive

A case for merging the ILP and DLP paradigms

Author: Espasa Sans Roger
Quintana Rodríguez Francisca
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

The goal of this paper is to show that instruction level parallelism (ILP) and data-level parallelism (DLP) can be merged in a single architecture to execute vectorizable code at a performance level that can not be achieved using either paradigm on its own. We will show that the combination of the two techniques yields very high performance at a low cost and a low complexity. We will show that this architecture can reach a performance equivalent to a superscalar processor that sustained 10 instructions per cycle. We will see that the machine exploiting both types of parallelism improves upon the ILP-only machine by factors of 1.5-1.8. We also present a study on the scalability of both paradigms and show that, when we increase resources to reach a 16-issue machine, the advantage of the ILP+DLP machine over the ILP-only machine increases up to 2.0-3.45. While the peak achieved IPC for the ILP machine is 4, the ILP+DLP machine exceeds 10 instructions per cycle.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Cost-effective compiler directed memory prefetching and bypassing

Author: Ayguadé Parra Eduard
Baer Jean-Loup
Ortega Fernández Daniel
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2002
Field of study

Ever increasing memory latencies and deeper pipelines push memory farther from the processor. Prefetching techniques aim is to bridge these two gaps by fetching data in advance to both the L1 cache and the register file. Our main contribution in this paper is a hybrid approach to the prefetching problem that combines both software and hardware prefetching in a cost-effective way by needing very little hardware support and impacting minimally the design of the processor pipeline. The prefetcher is built on-top of a static memory instruction bypassing, which is in charge of bringing prefetched values in the register file. In this paper we also present a thorough analysis of the limits of both prefetching and memory instruction bypassing. We also compare our prefetching technique with a prior speculative proposal that attacked the same problem, and we show that at much lower cost, our hybrid solution is better than a realistic implementation of speculative prefetching and bypassing. On average, our hybrid implementation achieves a 13% speed-up improvement over a version with software prefetching in a subset of numerical applications and an average of 43% over a version with no software prefetching (achieving up to a 102% for specific benchmarks).Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Instruction fetch architectures and code layout optimizations

Author: Larriba Pey Josep
Ramírez Bellido Alejandro
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2001
Field of study

The design of higher performance processors has been following two major trends: increasing the pipeline depth to allow faster clock rates, and widening the pipeline to allow parallel execution of more instructions. Designing a higher performance processor implies balancing all the pipeline stages to ensure that overall performance is not dominated by any of them. This means that a faster execution engine also requires a faster fetch engine, to ensure that it is possible to read and decode enough instructions to keep the pipeline full and the functional units busy. This paper explores the challenges faced by the instruction fetch stage for a variety of processor designs, from early pipelined processors, to the more aggressive wide issue superscalars. We describe the different fetch engines proposed in the literature, the performance issues involved, and some of the proposed improvements. We also show how compiler techniques that optimize the layout of the code in memory can be used to improve the fetch performance of the different engines described Overall, we show how instruction fetch has evolved from fetching one instruction every few cycles, to fetching one instruction per cycle, to fetching a full basic block per cycle, to several basic blocks per cycle: the evolution of the mechanism surrounding the instruction cache, and the different compiler optimizations used to better employ these mechanisms.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Quantifying the benefits of SPECint distant parallelism in simultaneous multithreading architectures

Author: Ayguadé Parra Eduard
Krishnan Venkata
Martel Pérez Iván
Ortega Fernández Daniel
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1999
Field of study

We exploit the existence of distant parallelism that future compilers could detect and characterise its performance under simultaneous multithreading architectures. By distant parallelism we mean parallelism that cannot be captured by the processor instruction window and that can produce threads suitable for parallel execution in a multithreaded processor. We show that distant parallelism can make feasible wider issue processors by providing more instructions from the distant threads, thus better exploiting the resources from the processor in the case of speeding up single integer applications. We also investigate the necessity of out-of-order processors in the presence of multiple threads of the same program. It is important to notice at this point that the benefits described are totally orthogonal to any other architectural techniques targeting a single thread.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Recommended from our members

Energy-aware embedded media processing: customizable memory subsystems and energy management policies

Author: Ramachandran Anand
Publication venue
Publication date: 01/01/2004
Field of study

textThe design of energy-efficient data memory architectures for embedded system platforms has received considerable attention in recent years. In this dissertation we propose a special-purpose data memory subsystem, called Xtream-Fit, targeted to streaming media applications executing on both generic uniprocessor embedded platforms and powerful SMT-based multi-threading platforms. We empirically demonstrate that Xtream-Fit achieves high energydelay efficiency across a wide range of media devices, from systems running a single media application to systems concurrently executing multiple media applications under synchronization constraints. Xtream-Fit’s energy efficiency is predicated on a novel task-based execution model that exposes/enhances opportunities for efficient prefetching, and aggressive dynamic energy conservation techniques targeting on-chip and off-chip memory components. A key novelty of Xtream-Fit is that it exposes a single customization parameter, thus enabling a very simple and yet effective design space exploration methodology to find the best memory configuration for the target application(s). Extensive experimental results show that Xtream-Fit reduces energy-delay product substantially – by 32% to 69% – as compared to ‘standard’ general-purpose memory subsystems enhanced with state of the art cache decay and SDRAM power mode control policies.Electrical and Computer Engineerin

Texas ScholarWorks

Fred: an architecture for a self-timed decoupled computer

Author: Brunvand Erik
Richardson William F.
Publication venue: University of Utah
Publication date: 01/01/1995
Field of study

Journal ArticleDecoupled computer architectures provide an effective means of exploiting instruction level parallelism. Self-timed micropipeline systems are inherently decoupled due to the elastic nature of the basic FIFO structure, and may be ideally suited for constructing decoupled computer architectures. Fred is a self-timed decoupled, pipelined computer architecture based on micropipelines. We present the architecture of Fred, with specific details on a micropipelined implementation that includes support for multiple functional units and out-of- order instruction completion due to the self-timed decoupling

The University of Utah: J. Willard Marriott Digital Library