140 research outputs found

    Within-Die Delay Variation Measurement And Analysis For Emerging Technologies Using An Embedded Test Structure

    Get PDF
    Both random and systematic within-die process variations (PV) are growing more severe with shrinking geometries and increasing die size. Escalation in the variations in delay and power with reductions in feature size places higher demands on the accuracy of variation models. Their availability can be used to improve yield, and the corresponding profitability and product quality of the fabricated integrated circuits (ICs). Sources of within-die variations include optical source limitations, and layout-based systematic effects (pitch, line-width variability, and microscopic etch loading). Unfortunately, accurate models of within-die PVs are becoming more difficult to derive because of their increasingly sensitivity to design-context. Embedded test structures (ETS) continue to play an important role in the development of models of PVs and as a mechanism to improve correlations between hardware and models. Variations in path delays are increasing with scaling, and are increasingly affected by neighborhood\u27 interactions. In order to fully characterize within-die variations, delays must be measured in the context of actual core-logic macros. Doing so requires the use of an embedded test structure, as opposed to traditional scribe line test structures such as ring oscillators (RO). Accurate measurements of within-die variations can be used, e.g., to better tune models to actual hardware (model-to-hardware correlations). In this research project, I propose an embedded test structure called REBEL (Regional dELay BEhavior) that is designed to measure path delays in a minimally invasive fashion; and its architecture measures the path delays more accurately. Design for manufacture-ability (DFM) analysis is done on the on 90 nm ASIC chips and 28nm Zynq 7000 series FPGA boards. I present ASIC results on within-die path delay variations in a floating-point unit (FPU) fabricated in IBM\u27s 90 nm technology, with 5 pipeline stages, used as a test vehicle in chip experiments carried out at nine different temperature/voltage (TV) corners. Also experimental data has been analyzed for path delay variations in short vs long paths. FPGA results on within-die variation and die-to-die variations on Advanced Encryption System (AES) using single pipelined stage are also presented. Other analysis that have been performed on the calibrated path delays are Flip Flop propagation delays for both rising and falling edge (tpHL and tpLH), uncertainty analysis, path distribution analysis, short versus long path variations and mid-length path within-die variation. I also analyze the impact on delay when the chips are subjected to industrial-level temperature and voltage variations. From the experimental results, it has been established that the proposed REBEL provides capabilities similar to an off-chip logic analyzer, i.e., it is able to capture the temporal behavior of the signal over time, including any static and dynamic hazards that may occur on the tested path. The ASIC results further show that path delays are correlated to the launch-capture (LC) interval used to time them. Therefore, calibration as proposed in this work must be carried out in order to obtain an accurate analysis of within-die variations. Results on ASIC chips show that short paths can vary up to 35% on average, while long paths vary up to 20% at nominal temperature and voltage. A similar trend occurs for within-die variations of mid-length paths where magnitudes reduced to 20% and 5%, respectively. The magnitude of delay variations in both these analyses increase as temperature and voltage are changed to increase performance. The high level of within-die delay variations are undesirable from a design perspective, but they represent a rich source of entropy for applications that make use of \u27secrets\u27 such as authentication, hardware metering and encryption. Physical unclonable functions (PUFs) are a class of primitives that leverage within-die-variations as a means of generating random bit strings for these types of applications, including hardware security and trust. Zynq FPGAs Die-to-Die and within-die variation study shows that on average there is 5% of within-Die variation and the range of die-to-Die variation can go upto 3ns. The die-to-Die variations can be explored in much further detail to study the variations spatial dependance. Additionally, I also carried out research in the area data mining to cater for big data by focusing the work on decision tree classification (DTC) to speed-up the classification step in hardware implementation. For this purpose, I devised a pipelined architecture for the implementation of axis parallel binary decision tree classification for meeting up with the requirements of execution time and minimal resource usage in terms of area. The motivation for this work is that analyzing larger data-sets have created abundant opportunities for algorithmic and architectural developments, and data-mining innovations, thus creating a great demand for faster execution of these algorithms, leading towards improving execution time and resource utilization. Decision trees (DT) have since been implemented in software programs. Though, the software implementation of DTC is highly accurate, the execution times and the resource utilization still require improvement to meet the computational demands in the ever growing industry. On the other hand, hardware implementation of DT has not been thoroughly investigated or reported in detail. Therefore, I propose a hardware acceleration of pipelined architecture that incorporates the parallel approach in acquiring the data by having parallel engines working on different partitions of data independently. Also, each engine is processing the data in a pipelined fashion to utilize the resources more efficiently and reduce the time for processing all the data records/tuples. Experimental results show that our proposed hardware acceleration of classification algorithms has increased throughput, by reducing the number of clock cycles required to process the data and generate the results, and it requires minimal resources hence it is area efficient. This architecture also enables algorithms to scale with increasingly large and complex data sets. We developed the DTC algorithm in detail and explored techniques for adapting it to a hardware implementation successfully. This system is 3.5 times faster than the existing hardware implementation of classification.\u2

    Analysis of performance variation in 16nm FinFET FPGA devices

    Get PDF

    Asynchronous techniques for new generation variation-tolerant FPGA

    Get PDF
    PhD ThesisThis thesis presents a practical scenario for asynchronous logic implementation that would benefit the modern Field-Programmable Gate Arrays (FPGAs) technology in improving reliability. A method based on Asynchronously-Assisted Logic (AAL) blocks is proposed here in order to provide the right degree of variation tolerance, preserve as much of the traditional FPGAs structure as possible, and make use of asynchrony only when necessary or beneficial for functionality. The newly proposed AAL introduces extra underlying hard-blocks that support asynchronous interaction only when needed and at minimum overhead. This has the potential to avoid the obstacles to the progress of asynchronous designs, particularly in terms of area and power overheads. The proposed approach provides a solution that is complementary to existing variation tolerance techniques such as the late-binding technique, but improves the reliability of the system as well as reducing the designā€™s margin headroom when implemented on programmable logic devices (PLDs) or FPGAs. The proposed method suggests the deployment of configurable AAL blocks to reinforce only the variation-critical paths (VCPs) with the help of variation maps, rather than re-mapping and re-routing. The layout level results for this method's worst case increase in the CLBā€™s overall size only of 6.3%. The proposed strategy retains the structure of the global interconnect resources that occupy the lionā€™s share of the modern FPGAā€™s soft fabric, and yet permits the dual-rail iv completion-detection (DR-CD) protocol without the need to globally double the interconnect resources. Simulation results of global and interconnect voltage variations demonstrate the robustness of the method

    A self-timed multipurpose delay sensor for field programmable gate arrays (FPGAs)

    Get PDF
    This paper presents a novel self-timed multi-purpose sensor especially conceived for Field Programmable Gate Arrays (FPGAs). The aim of the sensor is to measure performance variations during the life-cycle of the device, such as process variability, critical path timing and temperature variations. The proposed topology, through the use of both combinational and sequential FPGA elements, amplifies the time of a signal traversing a delay chain to produce a pulse whose width is the sensorā€™s measurement. The sensor is fully self-timed, avoiding the need for clock distribution networks and eliminating the limitations imposed by the system clock. One single off- or on-chip time-to-digital converter is able to perform digitization of several sensors in a single operation. These features allow for a simplified approach for designers wanting to intertwine a multi-purpose sensor network with their application logic. Employed as a temperature sensor, it has been measured to have an error of Ā±0.67 Ā°C, over the range of 20ā€“100 Ā°C, employing 20 logic elements with a 2-point calibration

    Delay Measurements and Self Characterisation on FPGAs

    No full text
    This thesis examines new timing measurement methods for self delay characterisation of Field-Programmable Gate Arrays (FPGAs) components and delay measurement of complex circuits on FPGAs. Two novel measurement techniques based on analysis of a circuit's output failure rate and transition probability is proposed for accurate, precise and efficient measurement of propagation delays. The transition probability based method is especially attractive, since it requires no modifications in the circuit-under-test and requires little hardware resources, making it an ideal method for physical delay analysis of FPGA circuits. The relentless advancements in process technology has led to smaller and denser transistors in integrated circuits. While FPGA users benefit from this in terms of increased hardware resources for more complex designs, the actual productivity with FPGA in terms of timing performance (operating frequency, latency and throughput) has lagged behind the potential improvements from the improved technology due to delay variability in FPGA components and the inaccuracy of timing models used in FPGA timing analysis. The ability to measure delay of any arbitrary circuit on FPGA offers many opportunities for on-chip characterisation and physical timing analysis, allowing delay variability to be accurately tracked and variation-aware optimisations to be developed, reducing the productivity gap observed in today's FPGA designs. The measurement techniques are developed into complete self measurement and characterisation platforms in this thesis, demonstrating their practical uses in actual FPGA hardware for cross-chip delay characterisation and accurate delay measurement of both complex combinatorial and sequential circuits, further reinforcing their positions in solving the delay variability problem in FPGAs

    A chip multiprocessor for a large-scale neural simulator

    Get PDF

    Predicting power scalability in a reconfigurable platform

    Get PDF
    This thesis focuses on the evolution of digital hardware systems. A reconfigurable platform is proposed and analysed based on thin-body, fully-depleted silicon-on-insulator Schottky-barrier transistors with metal gates and silicide source/drain (TBFDSBSOI). These offer the potential for simplified processing that will allow them to reach ultimate nanoscale gate dimensions. Technology CAD was used to show that the threshold voltage in TBFDSBSOI devices will be controllable by gate potentials that scale down with the channel dimensions while remaining within appropriate gate reliability limits. SPICE simulations determined that the magnitude of the threshold shift predicted by TCAD software would be sufficient to control the logic configuration of a simple, regular array of these TBFDSBSOI transistors as well as to constrain its overall subthreshold power growth. Using these devices, a reconfigurable platform is proposed based on a regular 6-input, 6-output NOR LUT block in which the logic and configuration functions of the array are mapped onto separate gates of the double-gate device. A new analytic model of the relationship between power (P), area (A) and performance (T) has been developed based on a simple VLSI complexity metric of the form ATĻƒ = constant. As Ļƒ defines the performance ā€œreturnā€ gained as a result of an increase in area, it also represents a bound on the architectural options available in power-scalable digital systems. This analytic model was used to determine that simple computing functions mapped to the reconfigurable platform will exhibit continuous power-area-performance scaling behavior. A number of simple arithmetic circuits were mapped to the array and their delay and subthreshold leakage analysed over a representative range of supply and threshold voltages, thus determining a worse-case range for the device/circuit-level parameters of the model. Finally, an architectural simulation was built in VHDL-AMS. The frequency scaling described by Ļƒ, combined with the device/circuit-level parameters predicts the overall power and performance scaling of parallel architectures mapped to the array

    A Hilbert-Curve Based Delay Fault Characerization Framework for Fpgas

    Get PDF
    Master'sMASTER OF ENGINEERIN

    Timing speculation and adaptive reliable overclocking techniques for aggressive computer systems

    Get PDF
    Computers have changed our lives beyond our own imagination in the past several decades. The continued and progressive advancements in VLSI technology and numerous micro-architectural innovations have played a key role in the design of spectacular low-cost high performance computing systems that have become omnipresent in today\u27s technology driven world. Performance and dependability have become key concerns as these ubiquitous computing machines continue to drive our everyday life. Every application has unique demands, as they run in diverse operating environments. Dependable, aggressive and adaptive systems improve efficiency in terms of speed, reliability and energy consumption. Traditional computing systems run at a fixed clock frequency, which is determined by taking into account the worst-case timing paths, operating conditions, and process variations. Timing speculation based reliable overclocking advocates going beyond worst-case limits to achieve best performance while not avoiding, but detecting and correcting a modest number of timing errors. The success of this design methodology relies on the fact that timing critical paths are rarely exercised in a design, and typical execution happens much faster than the timing requirements dictated by worst-case design methodology. Better-than-worst-case design methodology is advocated by several recent research pursuits, which exploit dependability techniques to enhance computer system performance. In this dissertation, we address different aspects of timing speculation based adaptive reliable overclocking schemes, and evaluate their role in the design of low-cost, high performance, energy efficient and dependable systems. We visualize various control knobs in the design that can be favorably controlled to ensure different design targets. As part of this research, we extend the SPRIT3E, or Superscalar PeRformance Improvement Through Tolerating Timing Errors, framework, and characterize the extent of application dependent performance acceleration achievable in superscalar processors by scrutinizing the various parameters that impact the operation beyond worst-case limits. We study the limitations imposed by short-path constraints on our technique, and present ways to exploit them to maximize performance gains. We analyze the sensitivity of our technique\u27s adaptiveness by exploring the necessary hardware requirements for dynamic overclocking schemes. Experimental analysis based on SPEC2000 benchmarks running on a SimpleScalar Alpha processor simulator, augmented with error rate data obtained from hardware simulations of a superscalar processor, are presented. Even though reliable overclocking guarantees functional correctness, it leads to higher power consumption. As a consequence, reliable overclocking without considering on-chip temperatures will bring down the lifetime reliability of the chip. In this thesis, we analyze how reliable overclocking impacts the on-chip temperature of a microprocessor and evaluate the effects of overheating, due to such reliable dynamic frequency tuning mechanisms, on the lifetime reliability of these systems. We then evaluate the effect of performing thermal throttling, a technique that clamps the on-chip temperature below a predefined value, on system performance and reliability. Our study shows that a reliably overclocked system with dynamic thermal management achieves 25% performance improvement, while lasting for 14 years when being operated within 353K. Over the past five decades, technology scaling, as predicted by Moore\u27s law, has been the bedrock of semiconductor technology evolution. The continued downscaling of CMOS technology to deep sub-micron gate lengths has been the primary reason for its dominance in today\u27s omnipresent silicon microchips. Even as the transition to the next technology node is indispensable, the initial cost and time associated in doing so presents a non-level playing field for the competitors in the semiconductor business. As part of this thesis, we evaluate the capability of speculative reliable overclocking mechanisms to maximize performance at a given technology level. We evaluate its competitiveness when compared to technology scaling, in terms of performance, power consumption, energy and energy delay product. We present a comprehensive comparison for integer and floating point SPEC2000 benchmarks running on a simulated Alpha processor at three different technology nodes in normal and enhanced modes. Our results suggest that adopting reliable overclocking strategies will help skip a technology node altogether, or be competitive in the market, while porting to the next technology node. Reliability has become a serious concern as systems embrace nanometer technologies. In this dissertation, we propose a novel fault tolerant aggressive system that combines soft error protection and timing error tolerance. We replicate both the pipeline registers and the pipeline stage combinational logic. The replicated logic receives its inputs from the primary pipeline registers while writing its output to the replicated pipeline registers. The organization of redundancy in the proposed Conjoined Pipeline system supports overclocking, provides concurrent error detection and recovery capability for soft errors, intermittent faults and timing errors, and flags permanent silicon defects. The fast recovery process requires no checkpointing and takes three cycles. Back annotated post-layout gate-level timing simulations, using 45nm technology, of a conjoined two-stage arithmetic pipeline and a conjoined five-stage DLX pipeline processor, with forwarding logic, show that our approach, even under a severe fault injection campaign, achieves near 100% fault coverage and an average performance improvement of about 20%, when dynamically overclocked
    • ā€¦
    corecore