192 research outputs found

    Impact of parameter variations on circuits and microarchitecture

    Get PDF
    Parameter variations, which are increasing along with advances in process technologies, affect both timing and power. Variability must be considered at both the circuit and microarchitectural design levels to keep pace with performance scaling and to keep power consumption within reasonable limits. This article presents an overview of the main sources of variability and surveys variation-tolerant circuit and microarchitectural approaches.Peer ReviewedPostprint (published version

    Exploiting Application Behaviors for Resilient Static Random Access Memory Arrays in the Near-Threshold Computing Regime

    Get PDF
    Near-Threshold Computing embodies an intriguing choice for mobile processors due to the promise of superior energy efficiency, extending the battery life of these devices while reducing the peak power draw. However, process, voltage, and temperature variations cause a significantly high failure rate of Level One cache cells in the near-threshold regime a stark contrast to designs in the super-threshold regime, where fault sites are rare. This thesis work shows that faulty cells in the near-threshold regime are highly clustered in certain regions of the cache. In addition, popular mobile benchmarks are studied to investigate the impact of run-time workloads on timing faults manifestation. A technique to mitigate the run-time faults is proposed. This scheme maps frequently used data to healthy cache regions by exploiting the application cache behaviors. The results show up to 78% gain in performance over two other state-of-the-art techniques

    How Multithreading Addresses the Memory Wall

    Get PDF
    The memory wall is the predicted situation where improvements to processor speed will be masked by the much slower improvement in dynamic random access (DRAM) memory speed. Since the prediction was made in 1995, considerable progress has been made in addressing the memory wall. There have been advances in DRAM organization, improved approaches to memory hierarchy have been proposed, integrating DRAM onto the processor chip has been investigated and alternative approaches to organizing the instruction stream have been researched. All of these approaches contribute to reducing the predicted memory wall effect; some can potentially be combined. This paper reviews several approaches with a view to assessing the most promising option. Given the growing CPU-DRAM speed gap, any strategy which finds alternative work while waiting for DRAM is likely to be a win

    A Shared memory multiprocessor system architecture utilizing a uniform

    Get PDF
    Due to VLSI lithography problems and the limitation of additional architectural enhancements uniprocessor systems are nearing the end of their life cycle. Therefore, it is believed that Symmetric Multiprocessing (SMP) systems will be the next mainstream computer. These systems allow multiple processors, accessing the same memory image, to cooperate on a number of computational tasks as a single entity. While multiprocessor systems can offer a substantial performance increase compared to uniprocessor systems, major design considerations must be addressed to achieve desired system efficiency levels. Managing cache coherence is a significant problem in multiprocessor systems. Current implementations cope with this problem by utilizing a cache coherence protocol. This protocol puts a large amount of overhead on the system bus to ensure proper program execution, effectively decreasing overall system performance. This thesis approaches the cache coherence problem from a new angle. Instead of utilizing a cache coherence protocol, a new memory system is proposed which eliminates the need for a cache coherence protocol, by utilizing a shared level 2 data-only cache. This new architecture allows for better utilization of the system and improved performance and scalability. A data rate analysis is conducted to demonstrate the potential performance increase from the proposed architecture over conventional approaches. The data rate model clearly shows an increase in system performance and utilization when using the architecture proposed in this thesis

    A coprocessor design for the architectural support of non-numeric operations

    Get PDF
    Computer Science is concerned with the electronic manipulation of information. Continually increasing amounts of computer time are being expended on information that is not numeric. This is represented in part by modem computing requirements such as the block moves associated with context switching and virtual memory management, peripheral device communication, compilers, editors, word processors, databases, and text retrieval. This dissertation examines the traditional support of non-numeric information from a software, firmware, and hardware perspective and presents a coprocessor design to improve the performance of a set of non-numeric operations. Simple micro-coding of operations can provide a degree of performance improvement through parallel execution of instructions and control store access speeds. New special purpose parallel hardware algorithms can yield complexity improvements. This dissertation presents a parallel hardware regular expression searching algorithm which requires linear time and quadratic space compared to software uniprocessor algorithms which require exponential time and space. A very large scale integration (VLSD implementation of a version of this algorithm was designed, fabricated, and tested. The hardware. searching algorithm is then combined with other special purpose hardware to implement a set of operations. Simulation is then used to quantify the performance improvement of the operations when compared to software solutions. A coprocessor approach allows the optional addition of hardware to accelerate a set of operations. This is appropriate from a complex instruction set computer (CISC) perspective since hardware acceleration is being utilized. It is also appropriate from a reduced instruction set computer (RISC) perspective since the operations are distributed away from the central processing unit (CPU)

    Doctor of Philosophy

    Get PDF
    dissertationWith the explosion of chip transistor counts, the semiconductor industry has struggled with ways to continue scaling computing performance in line with historical trends. In recent years, the de facto solution to utilize excess transistors has been to increase the size of the on-chip data cache, allowing fast access to an increased portion of main memory. These large caches allowed the continued scaling of single thread performance, which had not yet reached the limit of instruction level parallelism (ILP). As we approach the potential limits of parallelism within a single threaded application, new approaches such as chip multiprocessors (CMP) have become popular for scaling performance utilizing thread level parallelism (TLP). This dissertation identifies the operating system as a ubiquitous area where single threaded performance and multithreaded performance have often been ignored by computer architects. We propose that novel hardware and OS co-design has the potential to significantly improve current chip multiprocessor designs, enabling increased performance and improved power efficiency. We show that the operating system contributes a nontrivial overhead to even the most computationally intense workloads and that this OS contribution grows to a significant fraction of total instructions when executing several common applications found in the datacenter. We demonstrate that architectural improvements have had little to no effect on the performance of the OS over the last 15 years, leaving ample room for improvements. We specifically consider three potential solutions to improve OS execution on modern processors. First, we consider the potential of a separate operating system processor (OSP) operating concurrently with general purpose processors (GPP) in a chip multiprocessor organization, with several specialized structures acting as efficient conduits between these processors. Second, we consider the potential of segregating existing caching structures to decrease cache interference between the OS and application. Third, we propose that there are components within the OS itself that should be refactored to be both multithreaded and cache topology aware, which in turn, improves the performance and scalability of many-threaded applications

    Interference aware cache designs for operating system execution

    Get PDF
    Journal ArticleLarge-scale chip multiprocessors will likely be heterogeneous. It has been suggested by several groups that it may be worthwhile to implement some cores that are specially tuned to execute common code patterns. One such common application that will execute on all future processors is of course the operating system. Many future workloads will spend a large fraction of their execution time within privileged mode, either executing system calls or pure operating system functionality. Vast transistor budgets and relatively low on-chip communication latencies make it feasible to off-load the execution of privileged instruction sequences on to such a custom core. In this paper, we first examine this off-load approach and attempt to understand its benefits. We then try to architect a solution that captures the benefits of off-loading and eliminates its disadvantages. In essence, the benefits of offloading can be attributed to reduced cache interference, while its disadvantages are the high latency costs for off-load and cache coherence. Our proposed solution employs a special OS cache per core and improves performance by up to 18% for OS-intensive workloads without any significant addition of transistors. We consider several design choices for this OS cache and argue that it is a better use of transistor and power budget than the off-loading approach when both adding to the transistor budget or leaving it unchanged
    • …
    corecore