1,383 research outputs found

    Hardware schemes for early register release

    Get PDF
    Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the register file. In conventional register renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.Peer ReviewedPostprint (published version

    Machine Learning for Microprocessor Performance Bug Localization

    Full text link
    The validation process for microprocessors is a very complex task that consumes substantial engineering time during the design process. Bugs that degrade overall system performance, without affecting its functional correctness, are particularly difficult to debug given the lack of a golden reference for bug-free performance. This work introduces two automated performance bug localization methodologies based on machine learning that aims to aid the debugging process. Our results show that, the evaluated microprocessor core performance bugs whose average IPC impact is greater than 1%, our best-performing technique is able to localize the exact microarchitectural unit of the bug ∼\sim77\% of the time, while achieving a top-3 unit accuracy (out of 11 possible locations) of over 90% for bugs with the same average IPC impact. The proposed system in our simulation setup requires only a few seconds to perform a bug location inference, which leads to a reduced debugging time.Comment: 12 pages, 6 figure

    Limits of a decoupled out-of-order superscalar architecture

    Get PDF

    New Algorithms for High-Throughput Decoding with Low-Density Parity-Check Codes using Fixed-Point SIMD Processors

    Get PDF
    Most digital signal processors contain one or more functional units with a single-instruction, multiple-data architecture that supports saturating fixed-point arithmetic with two or more options for the arithmetic precision. The processors designed for the highest performance contain many such functional units connected through an on-chip network. The selection of the arithmetic precision provides a trade-off between the task-level throughput and the quality of the output of many signal-processing algorithms, and utilization of the interconnection network during execution of the algorithm introduces a latency that can also limit the algorithm\u27s throughput. In this dissertation, we consider the turbo-decoding message-passing algorithm for iterative decoding of low-density parity-check codes and investigate its performance in parallel execution on a processor of interconnected functional units employing fast, low-precision fixed-point arithmetic. It is shown that the frequent occurrence of saturation when 8-bit signed arithmetic is used severely degrades the performance of the algorithm compared with decoding using higher-precision arithmetic. A technique of limiting the magnitude of certain intermediate variables of the algorithm, the extrinsic values, is proposed and shown to eliminate most occurrences of saturation, resulting in performance with 8-bit decoding nearly equal to that achieved with higher-precision decoding. We show that the interconnection latency can have a significant detrimental effect of the throughput of the turbo-decoding message-passing algorithm, which is illustrated for a type of high-performance digital signal processor known as a stream processor. Two alternatives to the standard schedule of message-passing and parity-check operations are proposed for the algorithm. Both alternatives markedly reduce the interconnection latency, and both result in substantially greater throughput than the standard schedule with no increase in the probability of error

    Efficient design space exploration of embedded microprocessors

    Get PDF
    • …
    corecore