9,726 research outputs found

    Hardware schemes for early register release

    Get PDF
    Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the register file. In conventional register renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.Peer ReviewedPostprint (published version

    Improving latency tolerance of multithreading through decoupling

    Get PDF
    The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. The article presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. The study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. The study also reveals that multithreading by itself exhibits little memory latency tolerance. Results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.Peer ReviewedPostprint (published version

    Probability-Based Memory Access Controller (PMAC) for Energy Reduction in High Performance Processors

    Get PDF
    The increasing transistor density due to Moore's law scaling continues to drive the improvement in processor core performance with each process generation. The additional transistors are used to widen the pipeline, increase the size of the out-of-order instruction scheduling window, register files, queues and other pipeline data structures to extract high levels of instruction level parallelism and improve upon single- threaded performance. Such dynamically scheduled superscalar processor cores speculatively fetch and execute several instructions far ahead in a program, along the program path predicted by its branch predictors. During branch mispredictions, the architectural state of high performance processor cores can be restored at cost of high latency penalties, but the speculative memory requests sent by data memory access instructions on the mispredicted paths cannot be revoked. Such memory requests alter the data arrangement across memory hierarchy and result in wasted memory transactions, bandwidth and energy consumption. Even with low branch misprediction rates, these processor cores spend significant time on mispredicted program paths. In this thesis, we propose a probability based memory access controller to curb the data memory requests sent along mispredicted paths and achieve energy and memory bandwidth savings with minimum impact on performance. It computes path probability of instructions and throttles memory access instructions with low probability of execution. A deterministic or dynamically varying probability value is used as a threshold to control speculative memory requests sent to the memory hierarchy. The proposed design with a dynamic threshold reduces up to 51% of wrong path memory accesses and maximum of 31% of wrong path execution while achieving power savings up to 9.5% and maximum of 6.3% improvement in IPC/Watt in a single core processor system

    Improving processor efficiency by exploiting common-case behaviors of memory instructions

    Get PDF
    Processor efficiency can be described with the help of a number of  desirable effects or metrics, for example, performance, power, area, design complexity and access latency. These metrics serve as valuable tools used in designing new processors and they also act as  effective standards for comparing current processors. Various factors impact the efficiency of modern out-of-order processors and one important factor is the manner in which instructions are processed through the processor pipeline. In this dissertation research, we study the impact of load and store instructions (collectively known as memory instructions) on processor efficiency,  and show how to improve efficiency by exploiting common-case or  predictable patterns in the behavior of memory instructions. The memory behavior patterns that we focus on in our research are the predictability of memory dependences, the predictability in data forwarding patterns,   predictability in instruction criticality and conservativeness in resource allocation and deallocation policies. We first design a scalable  and high-performance memory dependence predictor and then apply accurate memory dependence prediction to improve the efficiency of the fetch engine of a simultaneous multi-threaded processor. We then use predictable data forwarding patterns to eliminate power-hungry  hardware in the processor with no loss in performance.  We then move to  studying instruction criticality to improve  processor efficiency. We study the behavior of critical load instructions  and propose applications that can be optimized using  predictable, load-criticality  information. Finally, we explore conventional techniques for allocation and deallocation  of critical structures that process memory instructions and propose new techniques to optimize the same.  Our new designs have the potential to reduce  the power and the area required by processors significantly without losing  performance, which lead to efficient designs of processors.Ph.D.Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milo

    An evaluation of multiple branch predictor and trace cache advanced fetch unit designs for dynamically scheduled superscalar processors

    Get PDF
    Semiconductor feature size continues to decrease permitting superscalar microprocessors to continue to increase the number of functional units available for execution. As the instruction issue width increases beyond the five instruction average basic block size of integer programs, more than one basic block must be issued per cycle to continue to increase instructions per cycle (IPC) performance. Researchers have created methods of fetching instructions beyond the first taken branch to overcome the bottleneck created by the limitations of conventional single branch predictors. We compare the performance of the multiple branch prediction (MBP) and trace cache (TC) fetch unit optimization methods. Multiple branch predictor fetch unit designs issue multiple basic blocks per cycle using a branch address cache and a multiple branch predictor. A trace cache uses the runtime instruction stream to create fixed length instruction traces that encapsulate multiple basic blocks. The comparison is performed by using a SPARC v8 timing based simulator. We simulate both advanced fetch methods and execute benchmarks from the SPEC CPU2000 suite. The results of the simulations are compared and a detailed analysis of both microarchitectures is performed. We find that both fetch unit designs provide a competitive IPC performance. As issue width is increased from an eight to sixteen way superscalar, the IPC performance improves implying that these fetch unit designs are able to take advantage of the wider issue widths. The MBP can use a smaller L1 instruction cache size than the TC and yet achieve a similar IPC performance. Pre-arranged instructions provided by the TC allow the pipeline stages to be shortened in comparison to the MBP. The shorter pipeline significantly improves the IPC performance. Prior trace cache research used two or more ports to the instruction cache to improve the chances of fetching a full basic block per cycle. This was at the expense of instruction cache line realignment complexity. Our results show good performance with a single instruction cache port. We study an approximately equal cost implementation for the MBP and TC. Of the six benchmarks studied, the TC outperforms the MBP over four of the benchmarks

    Weak-lensing shear measurement with machine learning: teaching artificial neural networks about feature noise

    Full text link
    Cosmic shear is a primary cosmological probe for several present and upcoming surveys investigating dark matter and dark energy, such as Euclid or WFIRST. The probe requires an extremely accurate measurement of the shapes of millions of galaxies based on imaging data. Crucially, the shear measurement must address and compensate for a range of interwoven nuisance effects related to the instrument optics and detector, noise, unknown galaxy morphologies, colors, blending of sources, and selection effects. This paper explores the use of supervised machine learning (ML) as a tool to solve this inverse problem. We present a simple architecture that learns to regress shear point estimates and weights via shallow artificial neural networks. The networks are trained on simulations of the forward observing process, and take combinations of moments of the galaxy images as inputs. A challenging peculiarity of this ML application is the combination of the noisiness of the input features and the requirements on the accuracy of the inverse regression. To address this issue, the proposed training algorithm minimizes bias over multiple realizations of individual source galaxies, reducing the sensitivity to properties of the overall sample of source galaxies. Importantly, an observational selection function of these source galaxies can be straightforwardly taken into account via the weights. We first introduce key aspects of our approach using toy-model simulations, and then demonstrate its potential on images mimicking Euclid data. Finally, we analyze images from the GREAT3 challenge, obtaining competitively low shear biases despite the use of a simple training set. We conclude that the further development of ML approaches is of high interest to meet the stringent requirements on the shear measurement in current and future surveys. A demonstration implementation of our technique is publicly available.Comment: 31 pages, 26 figures, minor changes to match the version published in A&A, code available at https://astro.uni-bonn.de/~mtewes/ml-shear-meas

    Compiler-Assisted Multiple Instruction Rollback Recovery Using a Read Buffer

    Get PDF
    Multiple instruction rollback (MIR) is a technique to provide rapid recovery from transient processor failures and was implemented in hardware by researchers and slow in mainframe computers. Hardware-based MIR designs eliminate rollback data hazards by providing data redundancy implemented in hardware. Compiler-based MIR designs were also developed which remove rollback data hazards directly with data flow manipulations, thus eliminating the need for most data redundancy hardware. Compiler-assisted techniques to achieve multiple instruction rollback recovery are addressed. It is observed that data some hazards resulting from instruction rollback can be resolved more efficiently by providing hardware redundancy while others are resolved more efficiently with compiler transformations. A compiler-assisted multiple instruction rollback scheme is developed which combines hardware-implemented data redundancy with compiler-driven hazard removal transformations. Experimental performance evaluations were conducted which indicate improved efficiency over previous hardware-based and compiler-based schemes. Various enhancements to the compiler transformations and to the data redundancy hardware developed for the compiler-assisted MIR scheme are described and evaluated. The final topic deals with the application of compiler-assisted MIR techniques to aid in exception repair and branch repair in a speculative execution architecture

    Factorization and Resummation for Groomed Multi-Prong Jet Shapes

    Full text link
    Observables which distinguish boosted topologies from QCD jets are playing an increasingly important role at the Large Hadron Collider (LHC). These observables are often used in conjunction with jet grooming algorithms, which reduce contamination from both theoretical and experimental sources. In this paper we derive factorization formulae for groomed multi-prong substructure observables, focusing in particular on the groomed D2D_2 observable, which is used to identify boosted hadronic decays of electroweak bosons at the LHC. Our factorization formulae allow systematically improvable calculations of the perturbative D2D_2 distribution and the resummation of logarithmically enhanced terms in all regions of phase space using renormalization group evolution. They include a novel factorization for the production of a soft subjet in the presence of a grooming algorithm, in which clustering effects enter directly into the hard matching. We use these factorization formulae to draw robust conclusions of experimental relevance regarding the universality of the D2D_2 distribution in both e+ee^+e^- and pppp collisions. In particular, we show that the only process dependence is carried by the relative quark vs. gluon jet fraction in the sample, no non-global logarithms from event-wide correlations are present in the distribution, hadronization corrections are controlled by the perturbative mass of the jet, and all global color correlations are completely removed by grooming, making groomed D2D_2 a theoretically clean QCD observable even in the LHC environment. We compute all ingredients to one-loop accuracy, and present numerical results at next-to-leading logarithmic accuracy for e+ee^+e^- collisions, comparing with parton shower Monte Carlo simulations. Results for pppp collisions, as relevant for phenomenology at the LHC, are presented in a companion paper.Comment: 66 pages, 18 figure
    corecore