503 research outputs found

    Leveraging register windows to reduce physical registers to the bare minimum

    Get PDF
    Register window is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper, we propose a software/hardware early register release technique that leverage register windows to drastically reduce the register requirements, and hence, reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the number of logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.Peer ReviewedPostprint (published version

    Early register release for out-of-order processors with register windows

    Get PDF
    Register windows is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper we propose two early register release techniques that leverages register windows to drastically reduce the register requirements, and hence reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the same number as logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.Peer ReviewedPostprint (published version

    Predicated execution and register windows for out-of-order processors

    Get PDF
    ISA extensions are a very powerful approach to implement new hardware techniques that require or benefit from compiler support: decisions made at compile time can be complemented at runtime, achieving a synergistic effect between the compiler and the processor. This thesis is focused on two ISA extensions: predicate execution and register windows. Predicate execution is exploited by the if-conversion compiler technique. If-conversion removes control dependences by transforming them to data dependences, which helps to exploit ILP beyond a single basic-block. Register windows help to reduce the amount of loads and stores required to save and restore registers across procedure calls by storing multiple contexts into a large architectural register file.In-order processors specially benefit from using both ISA extensions to overcome the limitations that control dependences and memory hierarchy impose on static scheduling. Predicate execution allows to move control dependence instructions past branches. Register windows reduce the amount of memory operations across procedure calls. Although if-conversion and register windows techniques have not been exclusively developed for in-order processors, their use for out-of-order processors has been studied very little. In this thesis we show that the uses of if-conversion and register windows introduce new performance opportunities and new challenges to face in out-of-order processors.The use of if-conversion in out-of-order processors helps to eliminate hard-to-predict branches, alleviating the severe performance penalties caused by branch mispredictions. However, the removal of some conditional branches by if-conversion may adversely affect the predictability of the remaining branches, because it may reduce the amount of correlation information available to the branch predictor. Moreover, predicate execution in out-of-order processors has to deal with two performance issues. First, multiple definitions of the same logical register can be merged into a single control flow, where each definition is guarded with a different predicate. Second, instructions whose guarding predicate evaluates to false consume unnecessary resources. This thesis proposes a branch prediction scheme based on predicate prediction that solves the three problems mentioned above. This scheme, which is built on top of a predicated ISA that implement a compare-and-branch model such as the one considered in this thesis, has two advantages: First, the branch accuracy is improved because the correlation information is not lost after if-conversion and the mechanism we propose permits using the computed value of the branch predicate when available, achieving 100% of accuracy. Second it avoids the predicate out-of-order execution problems.Regarding register windows, we propose a mechanism that reduces physical register requirements of an out-of-order processor to the bare minimum with almost no performance loss. The mechanism is based on identifying which architectural registers are in use by current in-flight instructions. The registers which are not in use, i.e. there is no in-flight instruction that references them, can be early released.In this thesis we propose a very efficient and low-cost hardware implementation of predicate execution and register windows that provide important benefits to out-of-order processors

    Towards Certification-aware Fault Injection Methodologies Using Virtual Prototypes

    Get PDF
    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Safety-critical applications are required today to meet more and more stringent standards than ever. In the need of reducing the costs associated with the certification step, early robustness evaluation can provide valuable information, as long as it is fast and accurate enough. Microarchitectural simulators have been employed for testing reliability properties in several domains in the past, but their use in the process of robustness verification of safety critical systems has not been validated yet, as opposed to RTL or gate-level simulations. In the present work, we propose a methodology to improve the accuracy of faultinjection results when targeting robustness verification, by using microarchitectural simulators and virtual prototypes for an early estimation of deviations with respect to the certification standards.The research leading to these results has received funding from the Ministry of Science and Technology of Spain under contract TIN2012-34557 and HiPEAC. Likewise, Jaume Abella is partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717.Espinosa García, J.; Andrés Martínez, DD.; Ruiz García, JC.; Hernández Luz, C.; Abella, J. (2015). Towards Certification-aware Fault Injection Methodologies Using Virtual Prototypes. IEEE Conference Publications. http://hdl.handle.net/10251/65831

    A Low-Power DSP Architecture for a Fully Implantable Cochlear Implant System-on-a-Chip.

    Full text link
    The National Science Foundation Wireless Integrated Microsystems (WIMS) Engineering Research Center at the University of Michigan developed Systems-on-a-Chip to achieve biomedical implant and environmental monitoring functionality in low-milliwatt power consumption and 1-2 cm3 volume. The focus of this work is implantable electronics for cochlear implants (CIs), surgically implanted devices that utilize existing nerve connections between the brain and inner-ear in cases where degradation of the sensory hair cells in the cochlea has occurred. In the absence of functioning hair cells, a CI processes sound information and stimulates the nderlying nerve cells with currents from implanted electrodes, enabling the patient to understand speech. As the brain of the WIMS CI, the WIMS microcontroller unit (MCU) delivers the communication, signal processing, and storage capabilities required to satisfy the aggressive goals set forth. The 16-bit MCU implements a custom instruction set architecture focusing on power-efficient execution by providing separate data and address register windows, multi-word arithmetic, eight addressing modes, and interrupt and subroutine support. Along with 32KB of on-chip SRAM, a low-power 512-byte scratchpad memory is utilized by the WIMS custom compiler to obtain an average of 18% energy savings across benchmarks. A synthesizable dynamic frequency scaling circuit allows the chip to select a precision on-chip LC or ring oscillator, and perform clock scaling to minimize power dissipation; it provides glitch-free, software-controlled frequency shifting in 100ns, and dissipates only 480ÎĽW. A highly flexible and expandable 16-channel Continuous Interleaved Sampling Digital Signal Processor (DSP) is included as an MCU peripheral component. Modes are included to process data, stimulate through electrodes, and allow experimental stimulation or processing. The entire WIMS MCU occupies 9.18mm2 and consumes only 1.79mW from 1.2V in DSP mode. This is the lowest reported consumption for a cochlear DSP. Design methodologies were analyzed and a new top-down design flow is presented that encourages hardware and software co-design as well as cross-domain verification early in the design process. An O(n) technique for energy-per-instruction estimations both pre- and post-silicon is presented that achieves less than 4% error across benchmarks. This dissertation advances low-power system design while providing an improvement in hearing recovery devices.Ph.D.Electrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/91488/1/emarsman_1.pd

    The Performance Cost of Security

    Get PDF
    Historically, performance has been the most important feature when optimizing computer hardware. Modern processors are so highly optimized that every cycle of computation time matters. However, this practice of optimizing for performance at all costs has been called into question by new microarchitectural attacks, e.g. Meltdown and Spectre. Microarchitectural attacks exploit the effects of microarchitectural components or optimizations in order to leak data to an attacker. These attacks have caused processor manufacturers to introduce performance impacting mitigations in both software and silicon. To investigate the performance impact of the various mitigations, a test suite of forty-seven different tests was created. This suite was run on a series of virtual machines that tested both Ubuntu 16 and Ubuntu 18. These tests investigated the performance change across version updates and the performance impact of CPU core number vs. default microarchitectural mitigations. The testing proved that the performance impact of the microarchitectural mitigations is non-trivial, as the percent difference in performance can be as high as 200%

    High Level Synthesis and Evaluation of an Automotive RADAR Signal Processing algorithm for FPGAs

    Get PDF
    High Level Synthesis (HLS) is a technology used to design and develop hardware (HW) using high-level languages such as C/C++. An HLS model of an automotive RADAR signal processing algorithm has been developed for the purpose of comparison between the HLS model and the existing HDL model. Register Transfer Level (RTL) programming is a technology used to design and develop hardware at the register transfer level (or low level) using Hardware description languages such as Verilog and VHDL. FPGA development usually requires the knowledge of RTL technologies. HLS gives software (SW) developers the ability to design and implement their designs on an FPGA without requiring the knowledge of RTL technologies and HDL. Even though HLS is currently gaining popularity, the applications used to evaluate HLS tend to remain small. We synthesize an automotive RADAR signal processing system using HLS-based design methodology, which has mid to high complexity, and compare our synthesis results to that of the RTL-based design. We used many techniques used to make the high-level program model ready for synthesis while optimizing for both speed and resource usage using Xilinx Vivado HLS Computer-Aided Design (CAD) tool. We achieved a speed up of 2X compared to the RTL-based design while reducing the design time from approximately 16 weeks to 6 weeks. The FPGA resource utilization increased but it was still under 5% of the total resources available on the FPGA

    Accelerated coplanar facet radio synthesis imaging

    Get PDF
    Imaging in radio astronomy entails the Fourier inversion of the relation between the sampled spatial coherence of an electromagnetic field and the intensity of its emitting source. This inversion is normally computed by performing a convolutional resampling step and applying the Inverse Fast Fourier Transform, because this leads to computational savings. Unfortunately, the resulting planar approximation of the sky is only valid over small regions. When imaging over wider fields of view, and in particular using telescope arrays with long non-East-West components, significant distortions are introduced in the computed image. We propose a coplanar faceting algorithm, where the sky is split up into many smaller images. Each of these narrow-field images are further corrected using a phase-correcting tech- nique known as w-projection. This eliminates the projection error along the edges of the facets and ensures approximate coplanarity. The combination of faceting and w-projection approaches alleviates the memory constraints of previous w-projection implementations. We compared the scaling performance of both single and double precision resampled images in both an optimized multi-threaded CPU implementation and a GPU implementation that uses a memory-access- limiting work distribution strategy. We found that such a w-faceting approach scales slightly better than a traditional w-projection approach on GPUs. We also found that double precision resampling on GPUs is about 71% slower than its single precision counterpart, making double precision resampling on GPUs less power efficient than CPU-based double precision resampling. Lastly, we have seen that employing only single precision in the resampling summations produces significant error in continuum images for a MeerKAT-sized array over long observations, especially when employing the large convolution filters necessary to create large images
    • …
    corecore