144 research outputs found

    Configurable computer systems can support dataflow computing

    Get PDF
    This work presents a practical implementation of a uni-processor system design. This design, named D2-CPU, satisfies the pure data-driven paradigm, which is a radical alternative to the conventional von Neumann paradigm and exploits the instruction-level parallelism to its full extent. The D2-CPU uses the natural flow of the program, dataflow, by minimizing redundant instructions like fetch, store, and write back. This leads to a design with the better performance, lower power consumption and efficient use of the on-chip resources. This extraordinary performance is the result of a simple, pipelined and superscalar architecture with a very wide data bus and a completely out of order execution of instructions. This creates a program counter less, distributed controlled system design with the realization of intelligent memories. Upon the availability of data, the instructions advance further in the memory hierarchy and ultimately to the execution units by themselves, instead of having the CPU fetch the required instructions from the memory as in controlled flow processors. This application (data) oriented execution process is in contrast to application ignorant CPUs in conventional machines. The D2-CPU solves current architectural challenges and puts into practice a pure data-driven microprocessor. This work employs an FPGA implementation of the D2-CPU to prove the practicability of the data-driven computer paradigm using configurable logic. A relative analysis at the end confirms its superiority in performance, resource utilization and ease of programming over conventional CPUs

    Exploiting address space contiguity to accelerate TLB miss handling

    Get PDF
    The traditional CPU-bound applications of the past have been replaced by multiple concurrent data-driven applications that use lots of memory. These applications, including databases and virtualization, put high stress on the virtual memory system which can have up to a 50% performance overhead for some applications. Virtualization compounds this problem, where the overhead can be upwards of 90%. While much research has been done on reducing the number of TLB misses, they can not be eliminated entirely. This thesis examines three techniques for reducing the cost of TLB miss handling. We test each against real-world workloads and find that the techniques that exploit course-grained locality in virtual address use and contiguity found in page tables show the best performance. The first technique reduces the overhead of multi-level page tables, such as those used in x86-64, with a dedicated MMU cache. We show that the most effective MMU caches are translation caches , which store partial translations and allow the page walk hardware to skip one or more levels of the page table. In recent years, both AMD and Intel processors have implemented MMU caches. However, their implementations are quite different and represent distinct points in the design space. This thesis introduces three new MMU cache structures that round out the design space and directly compares the effectiveness of all five organizations. This comparison shows that two of the newly introduced structures, both of which are translation cache variants, are better than existing structures in many situations. Secondly, this thesis examines the relative effectiveness of different page table organizations. Generally speaking, earlier studies concluded that organizations based on hashing, such as the inverted page table, outperformed organizations based upon radix trees for supporting large virtual address spaces. However, these studies did not take into account the possibility of caching page table entries from the higher levels of the radix tree. This work shows that any of the five MMU cache structures will reduce radix tree page table DRAM accesses far below an inverted page table. Finally, we present a novel device, the SpecTLB, that is able to exploit alignment in the mapping from virtual address to physical address to interpolate translations without any memory accesses at all. Operating system support for automatic page size selection leaves many small pages aligned within large page "reservations". While large pages improve TLB coverage, they limit the control the operating system has over memory allocation and protection. Our device allows the latency penalty of small pages to be avoided while maintaining fine-grained allocation and protection

    Memory system architecture for real-time multitasking systems

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (p. 119-120).by Scott Rixner.M.Eng

    Address generator synthesis

    Get PDF

    The 1992 4th NASA SERC Symposium on VLSI Design

    Get PDF
    Papers from the fourth annual NASA Symposium on VLSI Design, co-sponsored by the IEEE, are presented. Each year this symposium is organized by the NASA Space Engineering Research Center (SERC) at the University of Idaho and is held in conjunction with a quarterly meeting of the NASA Data System Technology Working Group (DSTWG). One task of the DSTWG is to develop new electronic technologies that will meet next generation electronic data system needs. The symposium provides insights into developments in VLSI and digital systems which can be used to increase data systems performance. The NASA SERC is proud to offer, at its fourth symposium on VLSI design, presentations by an outstanding set of individuals from national laboratories, the electronics industry, and universities. These speakers share insights into next generation advances that will serve as a basis for future VLSI design

    MMP

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 129-135).Reliability and security are quickly becoming users' biggest concern due to the increasing reliance on computers in all areas of society. Hardware-enforced, fine-grained memory protection can increase the reliability and security of computer systems, but will be adopted only if the protection mechanism does not compromise performance, and if the hardware mechanism can be used easily by existing software. Mondriaan memory protection (MMP) provides fine-grained memory protection for a linear address space, while supporting an efficient hardware implementation. MMP's use of linear addressing makes it compatible with current software programming models and program binaries, and it is also backwards compatible with current operating systems and instruction sets. MMP can be implemented efficiently because it separates protection information from program data, allowing protection information to be compressed and cached efficiently. This organization is similar to paging hardware, where the translation information for a page of data bytes is compressed to a single translation value and cached in the TLB. MMP stores protection information in tables in protected system memory, just as paging hardware stores translation information in page tables. MMP is well suited to improve the robustness of modern software. Modern software development favors modules (or plugins) as a way to structure and provide extensibility for large systems, like operating systems, web servers and web clients. Protection between modules written in unsafe languages is currently provided only by programmer convention, reducing system stability.(cont.) Device drivers, which are implemented as loadable modules, are now the most frequent source of operating system crashes (e.g., 85% of Windows XP crashes in one study [SBL03]). MMP provides a mechanism to enforce module boundaries, increasing system robustness by isolating modules from each other and making all memory sharing explicit. We implement the MMP hardware in a simulator and modify a version of the Linux 2.4.19 operating system to use it. Linux loads its device drivers as kernel module extensions, and MMP enforces the module boundaries, only allowing the device drivers access to the memory they need to function. The memory isolation provided by MMP increases Linux's resistance to programmer error, and exposed two kernel bugs in common, heavily-tested drivers. Experiments with several benchmarks where MMP was used extensively indicate the space taken by the MMP data structures is less than 11% of the memory used by the kernel, and the kernel's runtime, according to a simple performance model, increases less than 12% (relative to an unmodified kernel).by Emmett Jethro Witchel.Ph.D

    Implementation of a general purpose dataflow multiprocessor

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1988.GRSN 409671Includes bibliographical references (leaves 151-155).by Gregory Michael Papadopoulos.Ph.D

    No Excess Babbage - Design Considerations for the Interface to a Systolic Matrix Processor

    Get PDF
    Pages 174-179 removed for reasons of commercial confidentiality (including print copy)Computational systems used for the solution of large matrix-based numerical problems are rapidly converging on the limits of technology, and novel architectures are now being sought to improve performance. In signal processing and control theory, high performance systems are required which do not contain the physical size penalty of current supercomputers. To achieve performance comparable with current supercomputers, a systolic processing array specifically targeted at matrix applications has been developed at the University of Adelaide. The work of this thesis involves the problem of delivering and receiving the data moving between the processing array and the memory subsystem. This involves the reformatting of existing algorithms to map efficiently onto the matrix array, the design and VLSI layout of a matrix address generation unit using signed digit arithmetic for enhanced performance, and the block level description of a multiport cached memory system. Performance estimates predict a modest configuration will perform selected matrix routines in excess of three GigaFLOPs.Thesis (MESc.) -- University of Adelaide, Department of Electrical and Electronic Engineering, 199

    Scratchpad Memory Management For Multicore Real-Time Embedded Systems

    Get PDF
    Multicore systems will continue to spread in the domain of real-time embedded systems due to the increasing need for high-performance applications. This research discusses some of the challenges associated with employing multicore systems for safety-critical real-time applications. Mainly, this work is concerned with providing: 1) efficient inter-core timing isolation for independent tasks, and 2) predictable task communication for communicating tasks. Principally, we introduce a new task execution model, based on the 3-phase execution model, that exploits the Direct Memory Access (DMA) controllers available in modern embedded platforms along with ScratchPad Memories (SPMs) to enforce strong timing isolation between tasks. The DMA and the SPMs are explicitly managed to pre-load tasks from main memory into the local (private) scratchpad memories. Tasks are then executed from the local SPMs without accessing main memory. This model allows CPU execution to be overlapped with DMA loading/unloading operations from and to main memory. We show that by co-scheduling task execution on CPUs and using DMA to access memory and I/O, we can efficiently hide access latency to physical resources. In turn, this leads to significant improvements in system schedulability, compared to both the case of unregulated contention for access to physical resources and to previous cache and SPM management techniques for real-time systems. The presented SPM-centric scheduling algorithms and analyses cover single-core, partitioned, and global real-time systems. The proposed scheme is also extended to support large tasks that do not fit entirely into the local SPM. Moreover, the schedulability analysis considers the case of recovering from transient soft errors (bit flips caused by a single event upset) in several levels of memories, that cannot be automatically corrected in hardware by the ECC unit. The proposed SPM-centric scheduling is integrated at the OS level; thus it is transparent to applications. The proposed scheme is implemented and evaluated on an FPGA platform and a Commercial-Off-The-Shelf (COTS) platform. In regards to real-time task communication, two types of communication are considered. 1) Asynchronous inter-task communication, between either sequential tasks (single-threaded) or parallel tasks (multi-threaded). 2) Intra-task communication, where parallel threads of the same application exchange data. A new task scheduling model for parallel tasks (Bundled Scheduling) is proposed to facilitate intra-task communication and reduce synchronization overheads. We show that the proposed bundled scheduling model can be applied to several parallel programming models, such as fork-join and DAG-based applications, leading to improved system schedulability. Finally, intra-task communication is governed by a predictable inter-core communication platform. Specifically, we propose HopliteRT, a lean and predictable Network-on-Chip that connects the private SPMs
    corecore