Search CORE

34 research outputs found

Memory dependence prediction using store sets

Author: George Z. Chrysos
Hesson J.
Joel S. Emer
Lipasti M.
Steely S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

SAMIE-LSQ: set-associative multiple-instruction entry load/store queue

Author: Abella Ferrer Jaume
González Colás Antonio María
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

The load/store queue (LSQ) is one of the most complex parts of contemporary processors. Its latency is critical for the processor performance and it is usually one of the processor hotspots. This paper presents a highly banked, set-associative, multiple-instruction entry LSQ (SAMIE-LSQ,) that achieves high performance with small energy requirements. The SAMIE-LSQ classifies the memory instructions (loads and stores) based on the address to be accessed, and groups those instructions accessing the same cache line in the same entry. Our approach relies on the fact that many in-flight memory instructions access the same cache lines. Each SAMIE-LSQ entry has space for several memory instructions accessing the same cache line. This arrangement has a number of advantages. First, it significantly reduces the address comparison activity needed for memory disambiguation since there are less addresses to be compared. It also reduces the activity in the data TLB, the cache tag and cache data arrays. This is achieved by caching the cache line location and address translation in the corresponding SAMIE-LSQ entry once the access of one of the instructions in an entry is performed, so instructions that share an entry can reuse the translation, avoid the tag check and get the data directly from the concrete cache way without checking the others. Besides, the delay of the proposed scheme is lower than that required by a conventional LSQ. We show that the SAMIE-LSQ saves 82% dynamic energy for the load/store queue, 42% for the LI data cache and 73% for the data TLB, with a negligible impact on performance (0.6%)Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Reducción de consumo en la caché de datos de nivel 1 utilizando un predictor de forwarding

Author: Apolloni Rubén
Carazo Minguela Pablo
Castro Fernando
Chaver Daniel
Pinuel Luis
Tirado Francisco
Publication venue: E.U. de Informática (UPM)
Publication date: 01/01/2010
Field of study

En la mayoría de los diseños de los procesadores actuales, el acceso a la caché de datos de nivel 1 (L1D) se ha convertido en uno de los componentes de mayor consumo debido a su incremento de tamaño y elevadas frecuencias de acceso. Para reducir este consumo, proponemos una sencilla técnica de filtrado. Nuestra idea se basa en un predictor de forwarding de alta precisión que determina si una instrucción de load tomará su dato vía forwarding a través de la LSQ –evitando en este caso el acceso a la L1D- o si debe ir a por él a la caché de datos. Nuestros resultados de simulación muestran que podemos ahorrar de media un 35% del consumo de la L1D, con una degradación despreciable de rendimient

Archivo Digital UPM

Memory disambiguation hardware: a review

Author: Castro Fernando
Chaver Daniel
Piñuel Luis
Prieto Manuel
Tirado Fernández Francisco
Publication venue
Publication date: 01/10/2008
Field of study

One of the main challenges of modern processor designs is the implementation of scalable and efficient mechanisms to detect memory access order violations as a result of out-of-order execution. Conventional structures performing this task are complex, inefficient and power-hungry. This fact has generated a large body of work on optimizing address-based memory disambiguation logic, namely the load-store queue. In this paper we review the most significant proposals in this research field, focusing on our own contributions.Facultad de Informátic

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Servicio de Difusión de la Creación Intelectual

Bloom filtering cache misses for accurate data speculation and prefetching

Author: Jared Stark
Jih-Kwon Peir
Konrad Lai
Shih-Chang Lai
Shih-Lien Lu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2003
Field of study

Crossref

Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load/Store Optimization

Author: Roth Amir
Publication venue: ScholarlyCommons
Publication date: 01/01/2004
Field of study

A high-bandwidth, low-latency load-store unit is a critical component of a dynamically scheduled processor. Unfortunately, it is also one of the most complex and non-scalable components. Recently, several researchers have proposed techniques that simplify the core load-store unit and improve its scalability in exchange for the in-order pre-retirement re-execution of some subset of the loads in the program. We call such techniques load/store optimizations. One recent optimization attacks load queue (LQ) scalability by replacing the expensive associative search that is used to enforce intra- and inter- thread ordering with load re-execution. A second attacks store queue (SQ) scalability by speculatively filtering some load accesses and some store entries from it. The speculatively accessed, speculatively populated SQ can be made smaller and faster, but load re-execution is required to verify the speculation. A third uses a hardware table to identify redundant loads and skip their execution altogether. Redundant load elimination is highly accurate but not 100%, so re-execution is needed to flag false eliminations. Unfortunately, the inherent benefits of load/store optimizations are mitigated by re-execution itself. Re-execution contends for cache bandwidths with store retirement, and serializes load re-execution with subsequent store retirement. If a particular technique requires a sufficient number of load re-executions, the cost of these re-executions will outweigh the benefits of the technique entirely and may even produce drastic slowdowns. This is the case for the SQ technique. Store Vulnerability Window (SVW) is a new mechanism that reduces the re-execution requirements of a given load/store optimization significantly, by an average of 85% across the three load/store optimizations we study. This reduction relieves cache port contention and removes many of the dynamic serialization events that contribute the bulk of re-execution’s cost, and allows these techniques to perform up to their full potential. For the scalable SQ optimization, this means the chnace to perform at all. Without SVW, this technique posts significant slowdowns. SVW is a simple scheme based on monotonic store sequence numbering and a novel application of Bloom Filtering. The cost of an effective SVW implementation is a 1KB buffer and an 2B field per LQ entry

ScholarlyCommons@Penn