7 research outputs found

    Limits of a decoupled out-of-order superscalar architecture

    Get PDF

    Software and hardware methods for memory access latency reduction on ILP processors

    Get PDF
    While microprocessors have doubled their speed every 18 months, performance improvement of memory systems has continued to lag behind. to address the speed gap between CPU and memory, a standard multi-level caching organization has been built for fast data accesses before the data have to be accessed in DRAM core. The existence of these caches in a computer system, such as L1, L2, L3, and DRAM row buffers, does not mean that data locality will be automatically exploited. The effective use of the memory hierarchy mainly depends on how data are allocated and how memory accesses are scheduled. In this dissertation, we propose several novel software and hardware techniques to effectively exploit the data locality and to significantly reduce memory access latency.;We first presented a case study at the application level that reconstructs memory-intensive programs by utilizing program-specific knowledge. The problem of bit-reversals, a set of data reordering operations extensively used in scientific computing program such as FFT, and an application with a special data access pattern that can cause severe cache conflicts, is identified in this study. We have proposed several software methods, including padding and blocking, to restructure the program to reduce those conflicts. Our methods outperform existing ones on both uniprocessor and multiprocessor systems.;The access latency to DRAM core has become increasingly long relative to CPU speed, causing memory accesses to be an execution bottleneck. In order to reduce the frequency of DRAM core accesses to effectively shorten the overall memory access latency, we have conducted three studies at this level of memory hierarchy. First, motivated by our evaluation of DRAM row buffer\u27s performance roles and our findings of the reasons of its access conflicts, we propose a simple and effective memory interleaving scheme to reduce or even eliminate row buffer conflicts. Second, we propose a fine-grain priority scheduling scheme to reorder the sequence of data accesses on multi-channel memory systems, effectively exploiting the available bus bandwidth and access concurrency. In the final part of the dissertation, we first evaluate the design of cached DRAM and its organization alternatives associated with ILP processors. We then propose a new memory hierarchy integration that uses cached DRAM to construct a very large off-chip cache. We show that this structure outperforms a standard memory system with an off-level L3 cache for memory-intensive applications.;Memory access latency has become a major performance bottleneck for memory-intensive applications. as long as DRAM technology remains its most cost-effective position for making main memory, the memory performance problem will continue to exist. The studies conducted in this dissertation attempt to address this important issue. Our proposed software and hardware schemes are effective and applicable, which can be directly used in real-world memory system designs and implementations. Our studies also provide guidance for application programmers to understand memory performance implications, and for system architects to optimize memory hierarchies

    Obtaining performance and programmability using reconfigurable hardware for media processing

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2002.Includes bibliographical references (p. 127-132).An imperative requirement in the design of a reconfigurable computing system or in the development of a new application on such a system is performance gains. However, such developments suffer from long-and-difficult programming process, hard-to-predict performance gains, and limited scope of applications. To address these problems, we need to understand reconfigurable hardware's capabilities and limitations, its performance advantages and disadvantages, re-think reconfigurable system architectures, and develop new tools to explore its utility. We begin by examining performance contributors at the system level. We identify those from general-purpose and those from dedicated components. We propose an architecture by integrating reconfigurable hardware within the general-purpose framework. This is to avoid and minimize dedicated hardware and organization for programmability. We analyze reconfigurable logic architectures and their performance limitations. This analysis leads to a theory that reconfigurable logic can never be clocked faster than a fixed-logic design based on the same fabrication technology. Though highly unpredictable, we can obtain a quick upper bound estimate on the clock speed based on a few parameters. We also analyze microprocessor architectures and establish an analytical performance model. We use this model to estimate performance bounds using very little information on task properties. These bounds help us to detect potential memory-bound tasks. For a compute-bound task, we compare its performance upper bound with the upper bound on reconfigurable clock speed to further rule out unlikely speedup candidates.(cont.) These performance estimates require very few parameters, and can be quickly obtained without writing software or hardware codes. They can be integrated with design tools as front end tools to explore speedup opportunities without costly trials. We believe this will broaden the applicability of reconfigurable computing.by Ling-Pei Kung.Ph.D

    Microarchitectural Techniques to Exploit Repetitive Computations and Values

    Get PDF
    La dependencia de datos es una de las principales razones que limitan el rendimiento de los procesadores actuales. Algunos estudios han demostrado, que las aplicaciones no pueden alcanzar m谩s de una decena de instrucciones por ciclo en un procesador ideal, con la simple limitaci贸n de las dependencias de datos. Esto sugiere que, desarrollar t茅cnicas que eviten la serializaci贸n causada por ellas, son importantes para acelerar el paralelismo a nivel de instrucci贸n y ser谩 crucial en los microprocesadores del futuro.Adem谩s, la innovaci贸n y las mejoras tecnol贸gicas en el dise帽o de los procesadores de los 煤ltimos diez a帽os han sobrepasado los avances en el dise帽o del sistema de memoria. Por lo tanto, la cada vez mas grande diferencia de velocidades de procesador y memoria, ha motivado que, los actuales procesadores de alto rendimiento se centren en las organizaciones cache para tolerar las altas latencias de memoria. Las memorias cache solventan en parte esta diferencia de velocidades, pero a cambio introducen un aumento de 谩rea del procesador, un incremento del consumo energ茅tico y una mayor demanda de ancho de banda de memoria, de manera que pueden llegar a limitar el rendimiento del procesador.En esta tesis se proponen diversas t茅cnicas microarquitect贸nicas que pueden aplicarse en diversas partes del procesador, tanto para mejorar el sistema de memoria, como para acelerar la ejecuci贸n de instrucciones. Algunas de ellas intentan suavizar la diferencia de velocidades entre el procesador y el sistema de memoria, mientras que otras intentan aliviar la serializaci贸n causada por las dependencias de datos. La idea fundamental, tras todas las t茅cnicas propuestas, consiste en aprovechar el alto porcentaje de repetici贸n de los programas convencionales.Las instrucciones ejecutadas por los programas de hoy en d铆a, tienden a ser repetitivas, en el sentido que, muchos de los datos consumidos y producidos por ellas son frecuentemente los mismos. Esta tesis denomina la repetici贸n de cualquier valor fuente y destino como Repetici贸n de Valores, mientras que la repetici贸n de valores fuente y operaci贸n de la instrucci贸n se distingue como Repetici贸n de Computaciones. De manera particular, las t茅cnicas propuestas para mejorar el sistema de memoria se basan en explotar la repetici贸n de valores producida por las instrucciones de almacenamiento, mientras que las t茅cnicas propuestas para acelerar la ejecuci贸n de instrucciones, aprovechan la repetici贸n de computaciones producida por todas las instrucciones.Data dependences are some of the most important hurdles that limit the performance of current microprocessors. Some studies have shown that some applications cannot achieve more than a few tens of instructions per cycle in an ideal processor with the sole limitation of data dependences. This suggests that techniques for avoiding the serialization caused by them are important for boosting the instruction-level parallelism and will be crucial for future microprocessors. Moreover, innovation and technological improvements in processor design have outpaced advances in memory design in the last ten years. Therefore, the increasing gap between processor and memory speeds has motivated that current high performance processors focus on cache memory organizations to tolerate growing memory latencies. Caches attempt to bridge this gap but do so at the expense of large amounts of die area, increment of the energy consumption and higher demand of memory bandwidth that can be progressively a greater limit to high performance.We propose several microarchitectural techniques that can be applied to various parts of current microprocessor designs to improve the memory system and to boost the execution of instructions. Some techniques attempt to ease the gap between processor and memory speeds, while the others attempt to alleviate the serialization caused by data dependences. The underlying aim behind all the proposed microarchitectural techniques is to exploit the repetitive behaviour in conventional programs. Instructions executed by real-world programs tend to be repetitious, in the sense that most of the data consumed and produced by several dynamic instructions are often the same. We refer to the repetition of any source or result value as Value Repetition and the repetition of source values and operation as Computation Repetition. In particular, the techniques proposed for improving the memory system are based on exploiting the value repetition produced by store instructions, while the techniques proposed for boosting the execution of instructions are based on exploiting the computation repetition produced by all the instructions

    Methoden zur applikationsspezifischen Effizienzsteigerung adaptiver Prozessorplattformen

    Get PDF
    General-Purpose Prozessoren sind f眉r den durchschnittlichen Anwendungsfall optimiert, wodurch vorhandene Ressourcen nicht effizient genutzt werden. In der vorliegenden Arbeit wird untersucht, in wie weit es m枚glich ist, einen General-Purpose Prozessor an einzelne Anwendungen anzupassen und so die Effizienz zu steigern. Die Adaption kann zur Laufzeit durch das Prozessor- oder Laufzeitsystem anhand der jeweiligen Systemparameter erfolgen, um eine Effizienzsteigerung zu erzielen
    corecore