Search CORE

82 research outputs found

Dynamic Simultaneous Multithreaded Architecture

Author: Lee B.
Ortiz-Arroyo Daniel
Publication venue: International Society of Computers and Their Applications
Publication date: 01/01/2003
Field of study

Trace-level reuse

Author: González Colás Antonio María
Molina Carlos
Tubella Murgadas Jordi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1999
Field of study

Trace-level reuse is based on the observation that some traces (dynamic sequences of instructions) are frequently repeated during the execution of a program, and in many cases, the instructions that make up such traces have the same source operand values. The execution of such traces will obviously produce the same outcome and thus, their execution can be skipped if the processor records the outcome of previous executions. This paper presents an analysis of the performance potential of trace-level reuse and discusses a preliminary realistic implementation. Like instruction-level reuse, trace-level reuse can improve performance by decreasing resource contention and the latency of some instructions. However, we show that trace-level reuse is more effective than instruction-level reuse because the former can avoid fetching the instructions of reused traces. This has two important benefits: it reduces the fetch bandwidth requirements, and it increases the effective instruction window size since these instructions do not occupy window entries. Moreover, trace-level reuse can compute all at once the result of a chain of dependent instructions, which may allow the processor to avoid the serialization caused by data dependences and thus, to potentially exceed the dataflow limit.Peer ReviewedPostprint (published version

CiteSeerX

UPCommons. Portal del coneixement obert de la UPC

Survey on Combinatorial Register Allocation and Instruction Scheduling

Author: Lozano Roberto Castañeda
Schulte Christian
Publication venue
Publication date: 01/01/2018
Field of study

Register allocation (mapping variables to processor registers or memory) and instruction scheduling (reordering instructions to increase instruction-level parallelism) are essential tasks for generating efficient assembly code in a compiler. In the last three decades, combinatorial optimization has emerged as an alternative to traditional, heuristic algorithms for these two tasks. Combinatorial optimization approaches can deliver optimal solutions according to a model, can precisely capture trade-offs between conflicting decisions, and are more flexible at the expense of increased compilation time. This paper provides an exhaustive literature review and a classification of combinatorial optimization approaches to register allocation and instruction scheduling, with a focus on the techniques that are most applied in this context: integer programming, constraint programming, partitioned Boolean quadratic programming, and enumeration. Researchers in compilers and combinatorial optimization can benefit from identifying developments, trends, and challenges in the area; compiler practitioners may discern opportunities and grasp the potential benefit of applying combinatorial optimization

arXiv.org e-Print Archive

Publikationer från KTH

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Quantifying the benefits of SPECint distant parallelism in simultaneous multithreading architectures

Author: Ayguadé Parra Eduard
Krishnan Venkata
Martel Pérez Iván
Ortega Fernández Daniel
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1999
Field of study

We exploit the existence of distant parallelism that future compilers could detect and characterise its performance under simultaneous multithreading architectures. By distant parallelism we mean parallelism that cannot be captured by the processor instruction window and that can produce threads suitable for parallel execution in a multithreaded processor. We show that distant parallelism can make feasible wider issue processors by providing more instructions from the distant threads, thus better exploiting the resources from the processor in the case of speeding up single integer applications. We also investigate the necessity of out-of-order processors in the presence of multiple threads of the same program. It is important to notice at this point that the benefits described are totally orthogonal to any other architectural techniques targeting a single thread.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Dynamic Data Dependence Tracking and Its Application to Branch Prediction

Author: Albonesi David H.
Chen Lei
Dropsho Steven
Publication venue
Publication date: 29/11/2006
Field of study

To continue to improve processor performance, microarchitects seek to increase the effective instruction level parallelism (ILP) that can be exploited in applications. A fundamental limit to improving ILP is data dependences among instructions. If data dependence information is available at run-time, there are many uses to improve ILP. Prior published examples include decoupled branch exectuion architectures and critical instruction detection. In this paper, we describe an efficient hardware mechanism to dynamically track the data dependence chains of the instructions in the pipeline. This information is available on a cycle-by-cycle basis to the microengine for optimizing its perfromance. We then use this design in a new value-based branch prediction design using Available Register Value Information (ARVI). From the use of data dependence information, the ARVI branch predictor has better prediction accuracy over a comparably sized hybrid branch perdictor. With ARVI used as the second-level branch predictor, the improved prediction accuracy results in a 12.6% performance improvement on average across the SPEC95 integer benchmark suite

Infoscience - École polytechnique fédérale de Lausanne

Toward a Core Design to Distribute an Execution on a Many-Core Processor

Author: A Cristal
A Nicolau
B Goossens
GS Tjaden
M Sharafeddine
RM Tomasulo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 31/08/2015
Field of study

International audienceThis paper presents a parallel execution model and a many-core processor design to run C programs in parallel. The model automatically builds parallel sections of machine instructions from the run trace. It parallelizes instructions fetches, renamings, executions and retirements. Predictor based fetch is replaced by a fetch-decode-and-partly-execute stage able to compute in-order most of the control instructions. Tomasulo's register renaming is extended to memory with a technique to match consumer/producer pairs. The Reorder Buffer is adapted to allow parallel retirement. The model is presented on a sum reduction example which is also used to give a short analytical evaluation of the model performance potential

Crossref

HAL Descartes

Hal-Diderot

Recommended from our members

Aide de Camp: Asymmetric Dual Core Design for Power and Energy Reduction ; CU-CS-964-03

Author: Ghiasi Soraya
Grunwald Dirk C
Publication venue: CU Scholar
Publication date: 01/05/2003
Field of study

CU Scholar Institutional Repository

Hardware-only stream prediction + cache prefetching + dynamic access ordering

Author: Mckee Sally A.
Zhang Chengqiang
Publication venue: University of Utah
Publication date: 01/01/1999
Field of study

Journal ArticleThe speed gap between processors and memory system is becoming the performance bottleneck for many applications, and computations with strided access patterns are among those that suffer most. The vectors used in such applications lack temporal and often spatial locality, and are usually too large to cache. In spite of their poor cache behavior, these access patterns have the advantage of being, predictable, which can be exploited to improve the efficiency of the memory subsystem. As a promising technique to relieve memory system bottleneck, prefetching has been studied in its various forms, and so is dynamic memory scheduling. This study builds on these results, combining a stride-based reference prediction table, a mechanism that prefetches L2 cache lines, and a memory controller that dynamically schedules accesses to a Direct Rambus memory subsystem. We find that such a system delivers impressive speedups for scientific applications with regular access patterns (reducing execution time by almost a factor of two) without negatively affecting the performance of non-streaming programs

The University of Utah: J. Willard Marriott Digital Library

Data speculative multithreaded architecture

Author: González Colás Antonio María
Marcuello Pedro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

We present a novel processor microarchitecture that relieves three of the most important bottlenecks of superscalar processors: the serialization imposed by true dependences, the relatively small window size and the instruction fetch bandwidth. The new architecture executes simultaneously multiple threads of control obtained from a single program by means of control speculation techniques that do not require any compiler/user support nor any special feature in the instruction set architecture. The multiple simultaneous threads execute different iterations of the same loop, which require the same fetch bandwidth as a single thread since they share the same code. Inter-thread dependences as well as the values that flow through them are speculated by means of data prediction techniques. The preliminary evaluation results show a significant speed-up when compared with a superscalar processor. In fact, the new processor architecture can achieve an IPC (instructions per cycle) rate even larger than the peak fetch bandwidthPeer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Recommended from our members

The dynamic simultaneous multithreaded processor

Author: Ortiz-Arroyo Daniel
Publication venue: 'Oregon State University'
Publication date
Field of study

This dissertation investigates diverse techniques to support multithreading in modern high performance processors. The mechanisms studied expand the architecture of a high performance superscalar processor to control efficiently the interaction between software-controlled and hardware-controlled multithreading. Additionally, dynamic speculative mechanisms are proposed to exploit thread-level-parallelism (TLP) and instruction-level-parallelism (ILP) on a Simultaneous Multithreading (SMT) architecture. First, the hybrid multithreaded execution model is discussed. This model combines software-controlled multithreading with hardware support for efficient context switching and thread scheduling. A thread scheduling technique called set scheduling is introduced and its impact on the overall performance is described. An analytical model of the hybrid multithreaded execution is developed and validated by simulation. Through stochastic simulation, we find that the application of the hybrid multithreaded execution model results in higher processor utilization than traditional software-controlled multithreading. Next, in the main part of this dissertation, a new architecture is proposed: the Dynamic Simultaneous Multithreading (DSMT) processor. In this architecture, multiple threads are identified and created speculatively at runtime without compiler help. Subsequently, a SMT processor core executes those threads. The performance of a DSMT processor was evaluated with a new execution-driven simulator developed specifically for the purpose. Our experimental results based on simulation show that DSMT architecture has very good potential to improve SMT processor's performance when there is only a single task available for execution

ScholarsArchive@OSU