Search CORE

671 research outputs found

Reducing branch delay to zero in pipelined processors

Author: González Colás Antonio María
Llaberia Griñó José M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1993
Field of study

A mechanism to reduce the cost of branches in pipelined processors is described and evaluated. It is based on the use of multiple prefetch, early computation of the target address, delayed branch, and parallel execution of branches. The implementation of this mechanism using a branch target instruction memory is described. An analytical model of the performance of this implementation makes it possible to measure the efficiency of the mechanism with a very low computational cost. The model is used to determine the size of cache lines that maximizes the processor performance, to compare the performance of the mechanism with that of other schemes, and to analyze the performance of the mechanism with two alternative cache organizations.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Recommended from our members

The use of Petri nets for modeling pipelined processors

Author: Razouk Rami R.
Publication venue: eScholarship, University of California
Publication date: 01/01/1987
Field of study

This paper discusses the use of Petri Nets for modeling and analyzing pipelined processors. Petri Nets are particularly well-suited to modeling the synchronization, buffering, resource contention and delicate timing so common in pipelined processors. Tools for simulating, animating and analyzing the behavior of the models are described. The usefulness of the tools and the analysis methods they support in evaluating the performance and analyzing the detailed timing of pipelined microprocessors is illustrated through an example

eScholarship - University of California

On the Determinism of Multi-core Processors

Author: Baufreton Philippe
Monsuez Bruno
Paun Vladimir-Alexandru
Publication venue: OASIcs - OpenAccess Series in Informatics. 1st French Singaporean Workshop on Formal Methods and Applications (FSFMA 2013)
Publication date: 01/01/2013
Field of study

Hard real time systems are evolving in order to respond to the increasing demand in complex functionalities while taking advantage of newer hardware. Software development for safety critical systems has to comply with strict requirements that will facilitate the certification process. During this process, each part of the system is evaluated, requiring a certain level of assurance in order to provide confidence in the product. In particular there must be a level of confidence that the system behaves deterministically that may be based on functionality, resources and time. The success of system verification depends greatly on the capacity to determine its exact behavior. Nonetheless, hardware evolved in order to maximize the average computation power throughput with little to no regard to the deterministic aspect. Therefore modern architectural features of processors, like pipelines, cache memories and co-processors, make it hard to verify that all the needed properties are respected. The multi-core is furthermore difficult to analyze as the architecture employs mechanisms that compromise strong spatial and temporal partitioning when using shared resources without rigorous access control like shared caches or shared input/outputs. In this paper we identify and analyze the main sources of nondeterminism of the multi-cores with regard to the timing estimation. Precise determination of the worst case execution time is a challenging task even in single-core architectures. The problems are accentuated in the multi-core context mainly due to the resource sharing that can lead to highly complex interactions or to nondeterminism. Most of the units that generate behaviors that are hard to take into account can be deactivated, but it is not always easy to predict the impact on the performance. Nevertheless some of the features cannot be disabled (such as the out of order execution or some nondeterministic crossbar access policies) which leads to the invalidation of the respective platform for applications with high criticality level. We will address the problematic units, propose configuration or architecture guidelines and estimate their impact on the performance and determinism of the system

Dagstuhl Research Online Publication Server

Integrating a floating point unit into the AT&T Hobbit(tm) microprocessor

Author: Holler Paul T.
Publication venue: Lehigh Preserve
Publication date
Field of study

Lehigh University: Lehigh Preserve

The "MIND" Scalable PIM Architecture

Author: Brodowicz Maciej
Sterling Thomas
Publication venue
Publication date: 01/01/2005
Field of study

MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

Caltech Authors

Scientific Computing on the Itanium® Processor

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2002
Field of study

Crossref

Adaptive runtime-assisted block prefetching on chip-multiprocessors

Author: Carpenter Paul M.
García Flores Víctor
Navarro Nacho
Ramirez Alex
Rico Carro Alejandro
Villavieja Prados Carlos
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC