200 research outputs found

    A comparison of cache hierarchies for SMT processors

    Get PDF
    In the multithread and multicore era, programs are forced to share part of the processor structures. On one hand, the state of the art in multithreading describes how efficiently manage and distribute inner resources such as reorder buffer or issue windows. On the other hand, there is a substantial body of works focused on outer resources, mainly on how to effectively share last level caches in multicores. Between these ends, first and second level caches have remained apart even if they are shared in most commercial multithreaded processors. This work analyzes multiprogrammed workloads as the worst-case scenario for cache sharing among threads. In order to obtain representative results, we present a sampling-based methodology that for multiple metrics such as STP, ANTT, IPC throughput, or fairness, reduces simulation time up to 4 orders of magnitude when running 8-thread workloads with an error lower than 3% and a confidence level of 97%. With the above mentioned methodology, we compare several state-of-the-art cache hierarchies, and observe that Light NUCA provides performance benefits in SMT processors regardless the organization of the last level cache. Most importantly, Light NUCA gains are consistent across the entire number of simulated threads, from one to eight.Peer ReviewedPostprint (author's final draft

    Power Management for Deep Submicron Microprocessors

    Get PDF
    As VLSI technology scales, the enhanced performance of smaller transistors comes at the expense of increased power consumption. In addition to the dynamic power consumed by the circuits there is a tremendous increase in the leakage power consumption which is further exacerbated by the increasing operating temperatures. The total power consumption of modern processors is distributed between the processor core, memory and interconnects. In this research two novel power management techniques are presented targeting the functional units and the global interconnects. First, since most leakage control schemes for processor functional units are based on circuit level techniques, such schemes inherently lack information about the operational profile of higher-level components of the system. This is a barrier to the pivotal task of predicting standby time. Without this prediction, it is extremely difficult to assess the value of any leakage control scheme. Consequently, a methodology that can predict the standby time is highly beneficial in bridging the gap between the information available at the application level and the circuit implementations. In this work, a novel Dynamic Sleep Signal Generator (DSSG) is presented. It utilizes the usage traces extracted from cycle accurate simulations of benchmark programs to predict the long standby periods associated with the various functional units. The DSSG bases its decisions on the current and previous standby state of the functional units to accurately predict the length of the next standby period. The DSSG presents an alternative to Static Sleep Signal Generation (SSSG) based on static counters that trigger the generation of the sleep signal when the functional units idle for a prespecified number of cycles. The test results of the DSSG are obtained by the use of a modified RISC superscalar processor, implemented by SimpleScalar, the most widely accepted open source vehicle for architectural analysis. In addition, the results are further verified by a Simultaneous Multithreading simulator implemented by SMTSIM. Leakage saving results shows an increase of up to 146% in leakage savings using the DSSG versus the SSSG, with an accuracy of 60-80% for predicting long standby periods. Second, chip designers in their effort to achieve timing closure, have focused on achieving the lowest possible interconnect delay through buffer insertion and routing techniques. This approach, though, taxes the power budget of modern ICs, especially those intended for wireless applications. Also, in order to achieve more functionality, die sizes are constantly increasing. This trend is leading to an increase in the average global interconnect length which, in turn, requires more buffers to achieve timing closure. Unconstrained buffering is bound to adversely affect the overall chip performance, if the power consumption is added as a major performance metric. In fact, the number of global interconnect buffers is expected to reach hundreds of thousands to achieve an appropriate timing closure. To mitigate the impact of the power consumed by the interconnect buffers, a power-efficient multi-pin routing technique is proposed in this research. The problem is based on a graph representation of the routing possibilities, including buffer insertion and identifying the least power path between the interconnect source and set of sinks. The novel multi-pin routing technique is tested by applying it to the ISPD and IBM benchmarks to verify the accuracy, complexity, and solution quality. Results obtained indicate that an average power savings as high as 32% for the 130-nm technology is achieved with no impact on the maximum chip frequency

    Improving processor efficiency by exploiting common-case behaviors of memory instructions

    Get PDF
    Processor efficiency can be described with the help of a number of  desirable effects or metrics, for example, performance, power, area, design complexity and access latency. These metrics serve as valuable tools used in designing new processors and they also act as  effective standards for comparing current processors. Various factors impact the efficiency of modern out-of-order processors and one important factor is the manner in which instructions are processed through the processor pipeline. In this dissertation research, we study the impact of load and store instructions (collectively known as memory instructions) on processor efficiency,  and show how to improve efficiency by exploiting common-case or  predictable patterns in the behavior of memory instructions. The memory behavior patterns that we focus on in our research are the predictability of memory dependences, the predictability in data forwarding patterns,   predictability in instruction criticality and conservativeness in resource allocation and deallocation policies. We first design a scalable  and high-performance memory dependence predictor and then apply accurate memory dependence prediction to improve the efficiency of the fetch engine of a simultaneous multi-threaded processor. We then use predictable data forwarding patterns to eliminate power-hungry  hardware in the processor with no loss in performance.  We then move to  studying instruction criticality to improve  processor efficiency. We study the behavior of critical load instructions  and propose applications that can be optimized using  predictable, load-criticality  information. Finally, we explore conventional techniques for allocation and deallocation  of critical structures that process memory instructions and propose new techniques to optimize the same.  Our new designs have the potential to reduce  the power and the area required by processors significantly without losing  performance, which lead to efficient designs of processors.Ph.D.Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milo

    Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors

    Full text link
    Los procesadores superescalares actuales utilizan un reorder buffer (ROB) para contabilizar las instrucciones en vuelo. El ROB se implementa como una cola FIFO first in first out en la que las instrucciones se insertan en orden de programa después de ser decodificadas, y de la que se extraen también en orden de programa en la etapa commit. El uso de esta estructura proporciona un soporte simple para la especulación, las excepciones precisas y la reclamación de registros. Sin embargo, el hecho de retirar instrucciones en orden puede degradar las prestaciones si una operación de alta latencia está bloqueando la cabecera del ROB. Varias propuestas se han publicado atacando este problema. La mayoría utiliza retirada de instrucciones fuera de orden de forma especulativa, requiriendo almacenar puntos de recuperación (checkpoints) para restaurar un estado válido del procesador ante un fallo de especulación. Normalmente, los checkpoints necesitan implementarse con estructuras hardware costosas, y además requieren un crecimiento de otras estructuras del procesador, lo cual a su vez puede impactar en el tiempo de ciclo de reloj. Este problema afecta a muchos tipos de procesadores actuales, independientemente del número de hilos hardware (threads) y del número de núcleos de cómputo (cores) que incluyan. Esta tesis abarca el estudio de la retirada no especulativa de instrucciones fuera de orden en procesadores superescalares, multithread y multicore.Ubal Tena, R. (2010). Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8535Palanci

    Power efficient approaches to redundant multithreading

    Get PDF
    Journal ArticleNoise and radiation-induced soft errors (transient faults) in computer systems have increased significantly over the last few years and are expected to increase even more as we move toward smaller transistor sizes and lower supply voltages. Fault detection and recovery can be achieved through redundancy. The emergence of chip multiprocessors (CMPs) makes it possible to execute redundant threads on a chip and provide relatively low-cost reliability. State-of-the-art implementations execute two copies of the same program as two threads (redundant multithreading), either on the same or on separate processor cores in a CMP, and periodically check results. Although this solution has favorable performance and reliability properties, every redundant instruction flows through a high-frequency complex out-of-order pipeline, thereby incurring a high power consumption penalty. This paper proposes mechanisms that attempt to provide reliability at a modest power and complexity cost. When executing a redundant thread, the trailing thread benefits from the information produced by the leading thread. We take advantage of this property and comprehensively study different strategies to reduce the power overhead of the trailing core in a CMP. These strategies include dynamic frequency scaling, in-order execution, and parallelization of the trailing thread

    Hybrid Designs for Caches and Cores.

    Full text link
    Processor power constraints have come to the forefront over the last decade, heralded by the stagnation of clock frequency scaling. High-performance core and cache designs often utilize power-hungry techniques to increase parallelism. Conversely, the most energy-efficient designs opt for a serial execution to avoid unnecessary overheads. While both of these extremes constitute one-size-fits-all approaches, a judicious mix of parallel and serial execution has the potential to achieve the best of both high-performing and energy-efficient designs. This dissertation examines such hybrid designs for cores and caches. Firstly, we introduce a novel, hybrid out-of-order/in-order core microarchitecture. Instructions that are steered towards in-order execution skip register allocation, reordering and dynamic scheduling. At the same time, these instructions can interleave on an instruction-by-instruction basis with instructions that continue to benefit from these conventional out-of-order mechanisms. Secondly, this dissertation revisits a hybrid technique introduced for L1 caches, way-prediction, in the context of last-level caches that are larger, have higher associativity, and experience less locality.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113484/1/sleimanf_1.pd

    Microarchitecture Choices and Tradeoffs for Maximizing Processing Efficiency.

    Full text link
    This thesis is concerned with hardware approaches for maximizing the number of independent instructions in the execution core and thereby maximizing the processing efficiency for a given amount of compute bandwidth. Compute bandwidth is the number of parallel execution units multiplied by the pipelining of those units in the processor. Keeping those computing elements busy is key to maximize processing efficiency and therefore power efficiency. While some applications have many independent instructions that can be issued in parallel without inefficiencies due to branch behavior, cache behavior, or instruction dependencies, most applications have limited parallelism and plenty of stalling conditions. This thesis presents two approaches to this problem, which in combination greatly increases the efficiency of the processor utilization of resources. The first approach addresses the problem of small basic blocks that arise when code has frequent branches. We introduce algorithms and mechanisms to predict multiple branches simultaneously and to fetch multiple non-continuous basic blocks every cycle along a predicted branch path. This makes what was previously an inherently serial process into a parallelized instruction fetch approach. For integer applications, the result is an increase in useful instruction fetch capacity of 40% when two basic blocks are fetched per cycle and 63% for three blocks per cycle. For floating point benchmarks, the associated improvement is 27% and 59%. The second approach addresses increasing the number of independent instructions to the execution core through simultaneous multi-threading (SMT). We compare to another multithreading approach, Switch-on-Event multithreading, and show that SMT is far superior. Intel Pentium 4 SMT microarchitecture algorithms are analyzed, and we look at the impact of SMT on power efficiency of the Pentium 4 Processor. A new metric, the SMT Energy Benefit is defined. Not only do we show that the SMT Energy Benefit for a given workload with SMT can be quite significant, we also generalize the results and build a model for what other future processors’ SMT Energy Benefit would be. We conclude that while SMT will continue to be an energy-efficient feature, as processors get more energy-efficient in general the relative SMT Energy Benefit may be reduced.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61740/1/dtmarr_1.pd

    Improving cache locality for thread-level speculation

    Full text link
    corecore