484 research outputs found

    A lisp oriented architecture

    Get PDF
    Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1994.Includes bibliographical references (p. 63-67).by John W.F. McClain.M.S

    Chapter One – An Overview of Architecture-Level Power- and Energy-Efficient Design Techniques

    Get PDF
    Power dissipation and energy consumption became the primary design constraint for almost all computer systems in the last 15 years. Both computer architects and circuit designers intent to reduce power and energy (without a performance degradation) at all design levels, as it is currently the main obstacle to continue with further scaling according to Moore's law. The aim of this survey is to provide a comprehensive overview of power- and energy-efficient “state-of-the-art” techniques. We classify techniques by component where they apply to, which is the most natural way from a designer point of view. We further divide the techniques by the component of power/energy they optimize (static or dynamic), covering in that way complete low-power design flow at the architectural level. At the end, we conclude that only a holistic approach that assumes optimizations at all design levels can lead to significant savings.Peer ReviewedPostprint (published version

    The "MIND" Scalable PIM Architecture

    Get PDF
    MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

    Superscalar RISC-V Processor with SIMD Vector Extension

    Get PDF
    With the increasing number of digital products in the market, the need for robust and highly configurable processors rises. The demand is convened by the stable and extensible open-sourced RISC-V instruction set architecture. RISC-V processors are becoming popular in many fields of applications and research. This thesis presents a dual-issue superscalar RISC-V processor design with dynamic execution. The proposed design employs the global sharing scheme for branch prediction and Tomasulo algorithm for out-of-order execution. The processor is capable of speculative execution with five checkpoints. Data flow in the instruction dispatch and commit stages is optimized to achieve higher instruction throughput. The superscalar processor is extended with a customized vector instruction set of single-instruction-multiple-data computations to specifically improve the performance on machine learning tasks. According to the definition of the proposed vector instruction set, the scratchpad memory and element-wise arithmetic units are implemented in the vector co-processor. Different test programs are evaluated on the fully-tested superscalar processor. Compared to the reference work, the proposed design improves 18.9% on average instruction throughput and 4.92% on average prediction hit rate, with 16.9% higher operating clock frequency synthesized on the Intel Arria 10 FPGA board. The forward propagation of a convolution neural network model is evaluated by the standalone superscalar processor and the integration of the vector co-processor. The vector program with software-level optimizations achieves 9.53× improvement on instruction throughput and 10.18× improvement on real-time throughput. Moreover, the integration also provides 2.22× energy efficiency compared with the superscalar processor along

    Mitigating the Effect of Misspeculations in Superscalar Processors

    Get PDF
    Modern superscalar processors highly rely on the speculative execution which speculatively executes instructions and then verifies. If the prediction is different from the execution result, a misspeculation recovery is performed. Misspeculation recovery penalties still account for a substantial amount of performance reduction. This work focuses on the techniques to mitigate the effect of recovery penalties and proposes practical mechanisms which are thoroughly implemented and analyzed. In general, we can divide the misspeculation penalty into four parts: misspeculation detection delay; stale instruction elimination delay; state restoration delay and pipeline fill delay. This dissertation does not consider the detection delay, instead, we design four innovative mechanisms. Some of these mechanisms target a specific recovery delay whereas others target multiple types of delay in a unified algorithm. Mower was designed to address the stale instruction elimination delay and the state restoration delay by using a special walker. When a misprediction is detected, the walker will scan and repair the instructions which are younger than the mispredicted instruction. During the walking procedure, the correct state is restored and the stale instructions are eliminated. Based on Mower, we further simplify the design and develop a Two-Phase recovery mechanism. This mechanism uses only a basic recovery mechanism except for the case in which the retire stage was stalled by a long latency instruction. When the retire stage is stalled, the second phase is launched and the instructions in the pipeline are re-fetched. Two-Phase mechanism recovers from an earlier point in the program and overlaps the recovery penalty with the long latency penalty. In reality, some of the instructions on the wrong path can be reused during the recovery. However, such reuse of misprediction results is not easy and most of the time involves significant complexity. We design Passing Loop to reduce the pipeline fill delay. We applied our mechanism only for short forward branches which eliminates a substantial amount of complexity. In terms of memory dependence speculation and associated delays due to memory ordering violations, we develop a mechanism that optimizes store-queue-free architectures. A store-queue-free architecture experiences more memory dependence mispredictions due to its aggressive approach to speculations. A common solution is to delay the execution of an instruction which is more likely to be mispredicted. We propose a mechanism to dynamically insert predicates for comparing the address of memory instructions, which is called “Dynamic Memory Dependence Predication” (DMDP). This mechanism boosts the instruction execution to its earliest point and reduces the number of mispredictions

    Simultaneous multithreading: Operating system perspective

    Get PDF
    Developing CPU architecture is a very complicated, iterative process that requires significant time and money investments. The motivation for this work is to find ways to decreases the amount of time and money needed for the development of hardware architectures. The main problem is that it is very difficult to determine the performance of the architecture, since it is impossible to take any performance measurements untill upon completion of the development process. Consecutively, it is impossible to improve the performance of the product or to predict the influence of different parts of the architecture on the architecture\u27s overall performance. Another problem is that this type of development does not allow for the developed system to be reconfigured or altered without complete re-development. . The solution to the problems mentioned above is the software simulators that allow researching the architecture before even starting to cut the silicon.. Simultaneous multithreading (SMT) is a modern approach to CPU design. This technique increases the system throughput by decreasing both total instruction delay and stall times of the CPU. The gain in performance of a typical SMT processor is achieved by allowing the instructions from several threads to be fetched by an operating system into the CPU simultaneously. In order to function successfully the CPU needs software support. In modern computer systems the influence of an operating system on overall system performance can no longer be ignored. It is important to understand that the union of the CPU and the supporting operating system and their interdependency determines the overall performance of any computer system. In the system that has been implemented on hardware level such analysis is impossible, since the hardware system is neither flexible nor configurable. However, in the SMT architecture, the system is capable of performing some useful work even if a task has generated an error. A wide range of simulators is described in the literature, and a lot of them are publicly accessible. The main goal of this work is to modify an existing SEVIOS/Topsy simulator to achieve a simple, configurable, publicly accessible SMT SEVIOS/Topsy simulator that must also include an SMT Topsy.. The simulator should demonstrate the fetching process of the SMT MIPS, as well as scheduling aspects of the CPU and the operating system integrated environment.. This work covers a broad range of aspects, among which are: 1) Completion of SMT MIPS and SMT Topsy specifications; 2) Integration of MXS into SIMOS/Topsy; 3) Modifications to the fetching unit of MXS that allow to support SMT; 4) Addition of SMT support to Topsy;; This work uses Topsy/R4000 simulator developed at Swiss Federal Institute of Technology, and the MXS (R10000) part of the SimOS simulator developed at Stanford University. Development process utilizes C high-level language, Intel and MIPS assembly languages. The result of this work is a development of a complete computer system software simulator. The simulator allows taking performance measurements and reconfiguration of SMT Topsy and the fetching unit of the SMT MXS. The simulator is modular: that is any of its parts can be substituted with other parts that perform similar functionality. It also means that the whole simulator can be integrated into a larger scale simulation project. The development of this simulator significantly decreases the amount of time and money needed for the development of hardware architectures and provides new ways in researching the influence of an operating system on the performance of the computer system as a whole

    Anatomy of a message in the Alewife multiprocessor

    Get PDF
    Shared-memory provides a uniform and attractive mechanism for communication. For efficiency, it is often implemented with a layer of interpretive hardware on top of a message-passing communications network. This interpretive layer is responsible for data location, data movement, and cache coherence. It uses patterns of communication that benefit common programming styles, but which are only heuristics. This suggests that certain styles of communication may benefit from direct access to the underlying communications substrate. The Alewife machine, a shared-memory multiprocessor being built at MIT, provides such an interface. The interface is an integral part of the shared memory implementation and affords direct, user-level access to the network queues, supports an efficient DMA mechanism, and includes fast trap handling for message reception. This paper discusses the design and implementation of the Alewife message-passing interface and addresses the issues and advantages of using such an interface to complement hardware-synthesized shared memory.National Science Foundation (U.S.) (Grant MIP-9012773)United States. Defense Advanced Research Projects Agency (Contract N00014-87-K-0825
    corecore