29 research outputs found

    Near-optimal precharging in high-performance nanoscale CMOS caches

    Get PDF
    High-performance caches statically pull up the bit-lines in all cache subarrays to optimize cache access latency. Unfortunately, such architecture results in a significant waste of energy in nanoscale CMOS implementations due to high leakage and bitline discharge in the unaccessed subarrays. Recent research advocates bitline isolation to control precharging of individual subarrays using bitline precharge devices. In this paper, we carefully evaluate the energy and performance trade-offs of bitline isolation, and propose a technique to exploit nearly its full potential to eliminate discharge and reduce overall energy in level-one caches. Cycle-accurate and circuit simulation results of a wide-issue superscalar processor indicate that: 1) in future CMOS technologies (e.g., 70 nm and beyond), cache architectures that exploit bitline isolation can eliminate up to 90% of the bitline discharge; 2) on- demand precharging (i.e., decoding the address and subsequently precharging the accessed subarrays) is not viable in level-one caches because precharging increases the cache access latency; and 3) our proposal for gated precharging to exploit subarray reference locality and precharging only the recently accessed subarrays eliminates nearly all of bitline discharge in nanoscale CMOS caches with only a 1% of performance degradatio

    Profiling I/O interrupts in modern architectures

    Get PDF
    Journal ArticleAs applications grow increasingly communication-oriented, interrupt performance quickly becomes a crucial component of high performance I/O system design. At the same time, accurately measuring interrupt handler performance is difficult with the traditional simulation, instrumentation, or statistical sampling approaches. One o f the most important components o f interrupt performance is cache behavior. This paper presents a portable method for measuring the cache effects o f I/O interrupt handling using native hardware performance counters. To provide a portability stress test, the method is demonstrated on two commercial platforms with different architectures, the SGI Origin 200 and the Sun LJltra-1. This case study uses the methodology to measure the overhead of the two most common forms o f interrupt traffic: disk and network interrupts. The study demonstrates that the method works well and is reasonably robust. In addition, the results show that disk interrupts behave similar on both platforms, while differences in OS organization cause network interrupts to behave very differently. Furthermore, network interrupts exhibit significantly larger cache footprints.

    Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling

    Get PDF
    Journal ArticleAs clock frequency increases and feature size decreases, clock distribution and wire delays present a growing challenge to the designers of singly-clocked, globally synchronous systems. We describe an alternative approach, which we call a Multiple Clock Domain (MCD) processor in which the chip is divided into several (coarse-grained) clock domains, within which independent voltage and frequency scaling can be performed. Boundaries between domains are chosen to exploit existing queues, thereby minimizing inter-domain synchronization costs. We propose four clock domains, corresponding to the front end (including LI instruction cache), integer units, floating point units, and load-store units (including Ll data cache and L2 cache). We evaluate this design using a simulation infrastructure based on SimpleScalar and Wattch. In an attempt to quantify potential energy savings independent of any particular on-line control strategy, we use of-line analysis of traces from a single-speed run of each of our benchmark applications to identify profitable reconfiguration points for a subsequent dynamic scaling run. Dynamic runs incorporate a detailed model of inter-domain synchronization delays, with latencies for intra-domain scaling similar to the whole-chip scaling latencies of Intel XScale and Transmeta LongRun technologies. Using applications from the MediaBench, Olden, and SPEC2000 benchmark suites, we obtain an average energy-delay product improvement of 20% with MCD compared to a modest 3% savings from voltage scaling a single clock and voltage system

    Assessment of data rates on the internal and external CPU interfaces and its applications for Wireless Network-on-Chip development

    Get PDF
    Nowadays central processing units (CPUs) are the major part of the personal computers, and usually their progress defines personal computers (PCs) progress. However, modern CPU architecture has a set of limitations mentioned in this thesis. As a result, new CPU architectures are now under development. Most prospective solution in this field are based on a proposed concept of Wireless Networks-on-Chip (WNoCs), where part of wired connections is changed into wireless links. However in order to design and develop this kind of system, information about data rates on the internal and external CPU interfaces of modern CPUs is needed. Main goals set in the beginning of working on this thesis were to get this data rates assessment and give an assessment of suitable wireless technologies for milticore CPUs with different number of cores. In this thesis CPU evolution is described and peculiarities of modern CPU architectures are mentioned. Besides state-of-the-art overview for Wireless Networks-on-Chip is provided. Moreover, full methodology of measuring intra-CPU counters and getting data rates on cache bus between second and third level caches and third level cache and random access memory (RAM) controller bus are provided. Dependencies of data rates on interfaces of interest on the number of active CPU cores and CPU clock frequency are studied and provided in a form of plots. Also differences in the traffic for different types of CPU load are provided as bar diagrams. For testing we used several real-life tasks that are typical for CPUs and artificial tests which are represented as programs written in C programming language. In addition, extrapolation model for CPUs with bigger amount of cores is provided and assumption about suitable wireless technologies for different number of CPU cores is made

    Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay

    Get PDF
    Cache memories account for a significant fraction of a chip's overall energy dissipation. Recent research advocates using "resizable" caches to exploit cache requirement variability in applications to reduce cache size and eliminate energy dissipation in the cache's unused sections with minimal impact on performance. Current proposals for resizable caches fundamentally vary in two design aspects: (1) cache organization, where one organization, referred to as selective-ways, varies the cache's set- associativity, while the other, referred to as selective-sets, varies the number of cache sets, and (2) resizing strategy, where one proposal statically sets the cache size prior to an application's execution, while the other allows for dynamic resizing both within and across applications. In this paper, we compare and contrast, for the first time, the proposed design choices for resizable caches, and evaluate the effectiveness of cache resizings in reducing the overall energy-delay in deep-submicron processors. In addition, we propose a hybrid selective-sets-and-ways cache organization that always offers equal or better resizing granularity than both of previously proposed organizations. We also investigate the energy savings from resizing d-cache and i-cache together to characterize the interaction between d- cache and i-cache resizing

    Hardware-only stream prediction + cache prefetching + dynamic access ordering

    Get PDF
    Journal ArticleThe speed gap between processors and memory system is becoming the performance bottleneck for many applications, and computations with strided access patterns are among those that suffer most. The vectors used in such applications lack temporal and often spatial locality, and are usually too large to cache. In spite of their poor cache behavior, these access patterns have the advantage of being, predictable, which can be exploited to improve the efficiency of the memory subsystem. As a promising technique to relieve memory system bottleneck, prefetching has been studied in its various forms, and so is dynamic memory scheduling. This study builds on these results, combining a stride-based reference prediction table, a mechanism that prefetches L2 cache lines, and a memory controller that dynamically schedules accesses to a Direct Rambus memory subsystem. We find that such a system delivers impressive speedups for scientific applications with regular access patterns (reducing execution time by almost a factor of two) without negatively affecting the performance of non-streaming programs

    Implicit transactional memory in chip multiprocessors

    Get PDF
    Chip Multiprocessors (CMPs) are an efficient way of designing and use the huge amount of transistors on a chip. Different cores on a chip can compose a shared memory system with a very low-latency interconnect at a very low cost. Unfortunately, consistency models and synchronization styles of popular programming models for multiprocessors impose severe performance losses. Known architectural approaches to combat these losses are too complex, too specialized, or not transparent to the software. In this article, we introduce “implicit transactional memory” as a generalized architectural concept to remove such performance losses. We show how the concept of implicit transactions can be implemented at a low complexity by leveraging the multi-checkpoint mechanism of the Kilo-Instruction Processor. By relying on a general speculation substrate, it supports even the strictest consistency model – sequential consistency – potentially as effectively as weaker models and it allows multiple threads to speculatively execute critical sections, beyond barriers and event synchronizations.Postprint (published version

    Tao--an architecturally balanced reconfigurable hardware processor

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.Includes bibliographical references (p. 107-109).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.by Andrew S. Huang.M.Eng

    Affordable techniques for dependable microprocessor design

    Get PDF
    As high computing power is available at an affordable cost, we rely on microprocessor-based systems for much greater variety of applications. This dependence indicates that a processor failure could have more diverse impacts on our daily lives. Therefore, dependability is becoming an increasingly important quality measure of microprocessors.;Temporary hardware malfunctions caused by unstable environmental conditions can lead the processor to an incorrect state. This is referred to as a transient error or soft error. Studies have shown that soft errors are the major source of system failures. This dissertation characterizes the soft error behavior on microprocessors and presents new microarchitectural approaches that can realize high dependability with low overhead.;Our fault injection studies using RISC processors have demonstrated that different functional blocks of the processor have distinct susceptibilities to soft errors. The error susceptibility information must be reflected in devising fault tolerance schemes for cost-sensitive applications. Considering the common use of on-chip caches in modern processors, we investigated area-efficient protection schemes for memory arrays. The idea of caching redundant information was exploited to optimize resource utilization for increased dependability. We also developed a mechanism to verify the integrity of data transfer from lower level memories to the primary caches. The results of this study show that by exploiting bus idle cycles and the information redundancy, an almost complete check for the initial memory data transfer is possible without incurring a performance penalty.;For protecting the processor\u27s control logic, which usually remains unprotected, we propose a low-cost reliability enhancement strategy. We classified control logic signals into static and dynamic control depending on their changeability, and applied various techniques including commit-time checking, signature caching, component-level duplication, and control flow monitoring. Our schemes can achieve more than 99% coverage with a very small hardware addition. Finally, a virtual duplex architecture for superscalar processors is presented. In this system-level approach, the processor pipeline is backed up by a partially replicated pipeline. The replication-based checker minimizes the design and verification overheads. For a large-scale superscalar processor, the proposed architecture can bring 61.4% reduction in die area while sustaining the maximum performance
    corecore