239 research outputs found

    Fault- and Yield-Aware On-Chip Memory Design and Management

    Get PDF
    Ever decreasing device size causes more frequent hard faults, which becomes a serious burden to processor design and yield management. This problem is particularly pronounced in the on-chip memory which consumes up to 70% of a processor' s total chip area. Traditional circuit-level techniques, such as redundancy and error correction code, become less effective in error-prevalent environments because of their large area overhead. In this work, we suggest an architectural solution to building reliable on-chip memory in the future processor environment. Our approaches have two parts, a design framework and architectural techniques for on-chip memory structures. Our design framework provides important architectural evaluation metrics such as yield, area, and performance based on low level defects and process variations parameters. Processor architects can quickly evaluate their designs' characteristics in terms of yield, area, and performance. With the framework, we develop architectural yield enhancement solutions for on-chip memory structures including L1 cache, L2 cache and directory memory. Our proposed solutions greatly improve yield with negligible area and performance overhead. Furthermore, we develop a decoupled yield model of compute cores and L2 caches in CMPs, which show that there will be many more L2 caches than compute cores in a chip. We propose efficient utilization techniques for excess caches. Evaluation results show that excess caches significantly improve overall performance of CMPs

    New hardware support transactional memory and parallel debugging in multicore processors

    Get PDF
    This thesis contributes to the area of hardware support for parallel programming by introducing new hardware elements in multicore processors, with the aim of improving the performance and optimize new tools, abstractions and applications related with parallel programming, such as transactional memory and data race detectors. Specifically, we configure a hardware transactional memory system with signatures as part of the hardware support, and we develop a new hardware filter for reducing the signature size. We also develop the first hardware asymmetric data race detector (which is also able to tolerate them), based also in hardware signatures. Finally, we propose a new module of hardware signatures that solves some of the problems that we found in the previous tools related with the lack of flexibility in hardware signatures

    A low-cost high-speed twin-prefetching DSP-based shared-memory system for real-time image processing applications

    Get PDF
    This dissertation introduces, investigates, and evaluates a low-cost high-speed twin-prefetching DSP-based bus-interconnected shared-memory system for real-time image processing applications. The proposed architecture can effectively support 32 DSPs in contrast to a maximum of 4 DSPs supported by existing DSP-based bus- interconnected systems. This significant enhancement is achieved by introducing two small programmable fast memories (Twins) between the processor and the shared bus interconnect. While one memory is transferring data from/to the shared memory, the other is supplying the core processor with data. The elimination of the traditional direct linkage of the shared bus and processor data bus makes feasible the utilization of a wider shared bus i.e., shared bus width becomes independent of the data bus width of the processors. The fast prefetching memories and the wider shared bus provide additional bus bandwidth into the system, which eliminates large memory latencies; such memory latencies constitute the major drawback for the performance of shared-memory multiprocessors. Furthermore, in contrast to existing DSP-based uniprocessor or multiprocessor systems the proposed architecture does not require all data to be placed on on-chip or off-chip expensive fast memory in order to reach or maintain peak performance. Further, it can maintain peak performance regardless of whether the processed image is small or large. The performance of the proposed architecture has been extensively investigated executing computationally intensive applications such as real-time high-resolution image processing. The effect of a wide variety of hardware design parameters on performance has been examined. More specifically tables and graphs comprehensively analyze the performance of 1, 2, 4, 8, 16, 32 and 64 DSP-based systems, for a wide variety of shared data interconnect widths such as 32, 64, 128, 256 and 512. In addition, the effect of the wide variance of temporal and spatial locality (present in different applications) on the multiprocessor\u27s execution time is investigated and analyzed. Finally, the prefetching cache-size was varied from a few kilobytes to 4 Mbytes and the corresponding effect on the execution time was investigated. Our performance analysis has clearly showed that the execution time converges to a shallow minimum i.e., it is not sensitive to the size of the prefetching cache. The significance of this observation is that near optimum performance can be achieved with a small (16 to 300 Kbytes) amount of prefetching cache

    Managing Soft-errors in Transactional Systems

    Get PDF
    Abstract-Multicore architectures are becoming increasingly prone to soft-errors -i.e., transient faults caused by external physical phenomena such as electric noise and cosmic particle strikes. With increasing core counts, the soft-error rate is growing due to the accelerating transistor density on chips. The impact of these errors on business-critical applications that are being deployed on multicore hardware can be significant. We present an active replication-based approach that fully masks such errors for transactional applications. We partition computational cores, fully replicate objects across partitions, and concurrently execute transactional requests on all partitions, thereby enabling completely local object accesses. Transactional requests are globally ordered and delivered across partitions using optimistic atomic broadcast. Hardware message passing -an important emerging trend in multicore architectures -is exploited to mitigate communication costs. We report preliminary results obtained with an implementation of our approach on a 36-core Tilera TILE-Gx hardware, with an onchip scalable mesh network
    corecore