Search CORE

54 research outputs found

The Effect of Code Expanding Optimizations on Instruction Cache Design

Author: Chang Pohua P.
Chen William Y.
Conte Thomas M.
Hwu Wen-mei W.
Publication venue: Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Publication date: 01/05/1991
Field of study

Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Science Foundation / MIP-8809478NCRAMD 29K Advanced Processor Development DivisionNational Aeronautics and Space Administration / NASA NAG 1-613N00014-91-J-128

Illinois Digital Environment for Access to Learning and Scholarship Repository

Improving on-chip data cache using instruction register information.

Author
Publication venue: Department of Cultural and Religious Studies, The Chinese University of Hong Kong
Publication date: 01/01/1996
Field of study

by Lau Siu Chung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 71-74).Abstract --- p.iAcknowledgment --- p.iiList of Figures --- p.vChapter Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Hiding memory latency --- p.1Chapter 1.2 --- Organization of dissertation --- p.4Chapter Chapter 2 --- Related Work --- p.5Chapter 2.1 --- Hardware controlled cache prefetching --- p.5Chapter 2.2 --- Software assisted cache prefetching --- p.9Chapter Chapter 3 --- Data Prefetching --- p.13Chapter 3.1 --- Data reference patterns --- p.14Chapter 3.2 --- Embedded hints for next data references --- p.19Chapter 3.3 --- Instruction Opcode and Addressing Mode Prefetching scheme --- p.21Chapter 3.3.1 --- Basic IAP scheme --- p.21Chapter 3.3.2 --- Enhanced IAP scheme --- p.24Chapter 3.3.3 --- Combined IAP scheme --- p.27Chapter 3.4 --- Summary --- p.29Chapter Chapter 4 --- Performance Evaluation --- p.31Chapter 4.1 --- Evaluation methodology --- p.31Chapter 4.1.1 --- Trace-driven simulation --- p.31Chapter 4.1.2 --- Caching models --- p.33Chapter 4.1.3 --- Benchmarks and metrics --- p.36Chapter 4.2 --- General Results --- p.41Chapter 4.2.1 --- Varying cache size --- p.44Chapter 4.2.2 --- Varying cache block size --- p.46Chapter 4.2.3 --- Varying associativity --- p.49Chapter 4.3 --- Other performance metrics --- p.52Chapter 4.3.1 --- Accuracy of prefetch --- p.52Chapter 4.3.2 --- Partial hit delay --- p.55Chapter 4.3.3 --- Bus usage problem --- p.59Chapter 4.4 --- Zero time prefetch --- p.63Chapter 4.5 --- Summary --- p.67Chapter Chapter 5 --- Conclusion --- p.68Chapter 5.1 --- Summary of our research --- p.68Chapter 5.2 --- Future work --- p.70Bibliography --- p.7

CUHK Digital Repository

Affordable kilo-instruction processors

Author: Pericàs Gleim Miquel
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2008
Field of study

Diversos motius expliquen l'estancament en el que es troba el desenvolupament del processador tradicional dissenyat per maximitzar el rendiment d'un únic fil d'execució. Per una banda, técniques agressives com la supersegmentacó del camí de dades o l'execució fora d'ordre tenen un impacte molt negatiu sobre el consum de potència i la complexitat del disseny. Altrament, l'increment en la freqüència del processador augmenta la discrepància entre la velocitat del processador i el temps d'accés a memòria principal. Tot i que les memòries cau redueixen considerablement el nombre d'accessos a memòria principal, aquests accessos introdueixen latencies prou grans per reduir considerablement el rendiment. Tècniques convencionals com l'execució fora d'ordre, útils per ocultar accessos a les memòries cau de 2on nivell, no estan pensades per ocultar latències tan grans. Caldrien cues amb mides de centenars d'instruccions i milers de registres per tal de no interrompre l'execució en el moment de produir-se un accés a memòria principal. Desafortunadament, la tecnologia disponible no és eficient per implementar aquestes estructures monolíticament, doncs resultaria un temps d'accés molt elevat, un consum de potència igualment elevat i un àrea no menyspreable. En aquesta tesi s'han estudiat tècniques que permeten l'implementació d'un processador amb capacitat per continuar processant instruccions en el cas de que es produeixin accessos a memòria principal. Les condicions per a que aquest processador sigui implementable són que estigui basat en estructures de mida convencional i que tingui una unitat de control senzilla. El repte es troba en conciliar un model de processador distribuït amb un control senzill. El problema del disseny del processador s'ha enfocat observant el comportament d'un processador de recursos infinits. S'ha observat que l'execució segueix uns patrons molt interessants, basats en la localitat d'execució. En aplicacions numèriques s'observa que més del 70% de les instruccions no depenen de accessos a memòria principal. Aixó és molt important doncs mostra que sempre hi ha una porció important d'instruccions executables poc després de la decodificació. Aixó permet proposar un nou tipus de processador amb dues unitats d'execució. La primera unitat (el "Cache Processor") processa a alta velocitat instruccions independents de memòria principal. La segona unitat ("Memory Processor") processa les instruccions dependents de accessos a memòria principal, pero de forma molt més relaxada, cosa que li permet mantenir milers de instruccions en vol. Aquesta proposta rep el nom de Decoupled KILO-Instruction Processor (D-KIP) i té forces avantatges: per un costat permet la construcció d'un kilo-instruction processor basat en estructures convencionals i per l'altre simplifica el disseny ja que minimitza les interaccions entre ambdos unitats d'execució.En aquesta tesi es proposen dos implementacions de processadors desacoblats: el D-KIP original, i el Flexible Heterogeneous MultiCore (FMC). Sobre aquestes propostes s'analitza el rendiment i es compara amb altres tècniques que incrementan el parallelisme de memoria, com el prefetching o l'execució "runahead". D'aquesta avaluació es desprén que el processador FMC té un rendiment similar al de un processador convencional amb una finestra de 1500 instruccions en vol. Posteriorment s'analitza l'integració del FMC en entorns multicore/multiprogrammats. La tesi es completa amb la proposta d'una cua de loads i stores (LSQ) per a aquest tipus de processador.Several motives explain the slowdown of high-performance single-thread processor development. On the one hand, aggressive techniques such as superpipelining or out-of-order execution have a considerable impact on power consumption and design complexity. On the other hand, the increment in processor frequencies has led to a large disparity between processor speed and memory access time. Although cache memories considerably reduce the number of accesses to main memory, the remaining accesses introduce latencies large enough to considerably decrease performance. Conventional techniques such as out-of-order execution, while effective in hiding L2 cache accesses, cannot hide latencies this large. Queues of hundreds of entries and thousands of registers would be necessary in order to prevent execution from stalling in the event of a L2 cache miss. Unfortunately, current technology cannot efficiently implement such structures monolithically, as access latencies would considerably increase, as would power consumption and area consumption.In this thesis we studied techniques that allow the processor to continue processing instructions in the event of main memory accesses. The conditions for such a processor to be implementable are that it should be based on structures of conventional size and that it should feature simple control logic. The challenge lies in being able to design a distributed processor with simple control. The design of this processor has been approached by analyzing the behavior of a processor with infinite resources. We have observed that execution follows a very interesting pattern based on execution locality. In numerical codes we observed that over 70% of all instructions do not depend on memory accesses. This is interesting since it shows that there is always a large portion of instructions that can be executed shortly after decode. This allows us to propose a new kind of processor with two execution units. The first unit, the Cache Processor, processes memory-independent instructions at high speed. The second unit, the Memory Processor, processes instructions that depend on main memory accesses, but using relaxed scheduling logic, which allows it to scale to thousands of in-flight instructions. This proposal, which receives the name of Decoupled KILO-Instruction Processor (D-KIP), has several advantages. On the one hand it allows the construction of a kilo-instruction processor based on conventional structures and, on the other hand, it simplifies the design as the interaction between both execution units is minimal. In this thesis two implementations for this kind of processor are presented: the original D-KIP and the Flexible Heterogeneous MultiCore (FMC). The performance of these proposals is analyzed and compared to other proposals that increase memory-level parallelism, such as prefetching or runahead execution. It is observed that the FMC processor performs at the same level of a conventional processor with a window of around 1500 instructions. Further, the integration of the FMC processor into a multicore/multiprogrammed environment is studied. This thesis concludes with the proposal of a two-level Load/Store Queue for this kind of processor

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Recommended from our members

Software prefetching for indirect memory accesses: A microarchitectural perspective

Author: Ainsworth S
Jones TM
Publication venue: ACM Transactions on Computer Systems
Publication date: 01/01/2019
Field of study

Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting proposition to solve this is software prefetching, where special non-blocking loads are used to bring data into the cache hierarchy just before being required. However, these are difficult to insert to effectively improve performance, and techniques for automatic insertion are currently limited. This article develops a novel compiler pass to automatically generate software prefetches for indirect memory accesses, a special class of irregular memory accesses often seen in high-performance workloads. We evaluate this across a wide set of systems, all of which gain benefit from the technique. We then evaluate the extent to which good prefetch instructions are architecture dependent and the class of programs that are particularly amenable. Across a set of memory-bound benchmarks, our automated pass achieves average speedups of 1.3× for an Intel Haswell processor, 1.1× for both an ARM Cortex-A57 and Qualcomm Kryo, 1.2× for a Cortex-72 and an Intel Kaby Lake, and 1.35× for an Intel Xeon Phi Knight’s Landing, each of which is an out-of-order core, and performance improvements of 2.1× and 2.7× for the in-order ARM Cortex-A53 and first generation Intel Xeon Phi.EPSRC [EP/K026399/1, EP/M506485/1], ARM Ltd

Apollo (Cambridge)

Tolerating Radiation-Induced Transient Faults in Modern Processors

Author: A. Silberschatz
A.V. Aho
J.L. Hennessy
Jean-Luc Gaudiot
Xiaobin Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Improving cache locality for thread-level speculation

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Crossref

Minimizing Single-Usage Cache Pollution for Effective Cache Hierarchy Management

Author: Piquet Thomas
Rochecouste Olivier
Seznec André
Publication venue: HAL CCSD
Publication date: 01/01/2006
Field of study

Efficient cache hierarchy management is of a paramount importance when designing high performance processors. Upon a miss, the conventional operation mode of a cache hierarchy is to retrieve back the missing block from higher levels and to store the block into all hierarchy levels. It is however difficult to assert that storing the block into intermediate levels will be really useful. In the literature, this phenomenon, referred to as cache pollution, is often associated with prefetching techniques, that is, a prefetched block could evict data that is more likely to be reused in a near future. Cache pollution could cause severe performance degradation. This paper is typically concerned with addressing this phenomenon in the highest level of cache hierarchy. Unlike past studies that treat polluting cache blocks as blocks that are never accessed (i.e. only due to prefetching), our proposal rather attempts to eliminate cache pollution that is inherent to the application. Our observations did indeed reveal that cache blocks that are only accessed once - single-usage blocks - are quite significant at runtime and especially in the highest level of cache hierarchy. In addition, most single-usage cache blocks are data that can be prefetched. We show that employing a simple prediction mechanism is sufficient to uncover most of the single-usage blocks. For a two-level cache hierarchy, these blocks are directly sent from main memory to L1 cache. Performing data bypassing on L2 cache maximizes memory hierarchy and allows hard-toprefetch memory references to remain into this cache hierarchy level. Our experimental results show that minimizing single-usage cache pollution in the L2 cache leads to a significant decrease in its miss rate; resulting therefore in noticeable performance gains

INRIA a CCSD electronic archive server