9 research outputs found

    How general-purpose can a GPU be?

    Get PDF
    The use of graphics processing units (GPUs) in general-purpose computation (GPGPU) is a growing field. GPU instruction sets, while implementing a graphics pipeline, draw from a range of single instruction multiple datastream (SIMD) architectures characteristic of the heyday of supercomputers. Yet only one of these SIMD instruction sets has been of application on a wide enough range of problems to survive the era when the full range of supercomputer design variants was being explored: vector instructions. Supercomputers covered a range of exotic designs such as hypercubes and the Connection Machine (Fox, 1989). The latter is likely the source of the snide comment by Cray: it had thousands of relatively low-speed CPUs (Tucker & Robertson, 1988). Since Cray won, why are we not basing our ideas on his designs (Cray Inc., 2004), rather than those of the losers? The Top 500 supercomputer list is dominated by general-purpose CPUs, and nothing like the Connection Machine that headed the list in 1993 still exists

    Scratchpad memory management in a multitasking environment

    Full text link
    This paper presents a dynamic scratchpad memory (SPM) code allocation technique for embedded systems running an operating system with preemptive multitasking. Existing SPM allocation schemes do not support multiple tasks or only a fixed number of processes that are known at compile time. These schemes rely on algorithms that select code depending on the size of the SPM. In contemporary portable devices, however, processes are created and terminated on demand and the SPM is shared among them. We introduce a dynamic scratchpad memory code alloca-tion technique for code that supports dynamically created processes. At runtime, an SPM manager (SPMM) loads code pages of the running applications into the SPM on de-mand. It supports different sharing strategies that deter-mine how the SPM is distributed among the running pro-cesses. We analyze several sharing strategies with regard to several preferable properties of multiprocess SPM allocation schemes. We evaluate the proposed multiprocess SPM allocation techniques and compare them to a fully-cached reference system by running several multiprocess benchmarks. The benchmarks comprise of multiple embedded applications such as H.264, MP3, MPEG-4, and PGP. On average, we achieve a 47 % improvement in throughput and a 32 % re-duction in energy consumption. A comparison with the un-achievable lower bound shows that the best SPM sharing strategy exploits 87 % of the runtime improvements and 89% of the energy savings possible

    Software instruction caching

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 185-193).As microprocessor complexities and costs skyrocket, designers are looking for ways to simplify their designs to reduce costs, improve energy efficiency, or squeeze more computational elements on each chip. This is particularly true for the embedded domain where cost and energy consumption are paramount. Software instruction caches have the potential to provide the required performance while using simpler, more efficient hardware. A software cache consists of a simple array memory (such as a scratchpad) and a software system that is capable of automatically managing that memory as a cache. Software caches have several advantages over traditional hardware caches. Without complex cache-management logic, the processor hardware is cheaper and easier to design, verify and manufacture. The reduced access energy of simple memories can result in a net energy savings if management overhead is kept low. Software caches can also be customized to each individual program's needs, improving performance or eliminating unpredictable timing for real-time embedded applications. The greatest challenge for a software cache is providing good performance using general-purpose instructions for cache management rather than specially-designed hardware. This thesis designs and implements a working system (Flexicache) on an actual embedded processor and uses it to investigate the strengths and weaknesses of software instruction caches. Although both data and instruction caches can be implemented in software, very different techniques are used to optimize performance; this work focuses exclusively on software instruction caches. The Flexicache system consists of two software components: a static off-line preprocessor to add caching to an application and a dynamic runtime system to manage memory during execution. Key interfaces and optimizations are identified and characterized. The system is evaluated in detail from the standpoints of both performance and energy consumption. The results indicate that software instruction caches can perform comparably to hardware caches in embedded processors. On most benchmarks, the overhead relative to a hardware cache is less than 12% and can be as low as 2.4%. At the same time, the software cache uses up to 6% less energy. This is achieved using a simple, directly-addressed memory and without requiring any complex, specialized hardware structures.by Jason Eric Miller.Ph.D

    How Multithreading Addresses the Memory Wall

    Get PDF
    The memory wall is the predicted situation where improvements to processor speed will be masked by the much slower improvement in dynamic random access (DRAM) memory speed. Since the prediction was made in 1995, considerable progress has been made in addressing the memory wall. There have been advances in DRAM organization, improved approaches to memory hierarchy have been proposed, integrating DRAM onto the processor chip has been investigated and alternative approaches to organizing the instruction stream have been researched. All of these approaches contribute to reducing the predicted memory wall effect; some can potentially be combined. This paper reviews several approaches with a view to assessing the most promising option. Given the growing CPU-DRAM speed gap, any strategy which finds alternative work while waiting for DRAM is likely to be a win

    Memory management in a distributed system of single address space operating systems supporting quality of service

    Get PDF
    The choices provided by an operating system to the application developer for managing memory came in two forms: no choice at all, with the operating system making all decisions about managing memory; or the choice to implement virtual memory management specific to the individual application. The second of these choices is, for all intents and purposes, the same as the first: no choice at all. For many application developers, the cost of implementing a customised virtual memory management system is just too high. The results is that, regardless of the level of flexibility available, the developer ends up using the system-provided default. Further exacerbating the problem is the tendency for operating system developers to be extremely unimaginative when providing that same default. Advancements in virtual memory techniques such as prefetching, remote paging, compressed caching, and user-level page replacement coupled with the provision of user-level virtual memory management should have heralded a new era of choice and an application-centric approach to memory management. Unfortunately, this has failed to materialise. This dissertation describes the design and implementation of the Heracles virtual memory management system. The Heracles approach is one of inclusion rather than exclusion. The main goal of Heracles is to provide an extensible environment that is configurable to the extent of providing application-centric memory management without the need for application developers to implement their own. However, should the application developer wish to provide a more specialised implementation for all or any part of Heracles, the system is constructed around well-defined interfaces that allow new implementations to be "plugged in" where required. The result is a virtual memory management hierarchy that is highly configurable, highly flexible, and can be adapted at run-time to meet new phases in the application's behaviour. Furthermore, different parts of an application's address space can have different hierarchies associated with managing its memory

    Hardware-Software Trade-Offs in a Direct Rambus Implementation of the RAMpage Memory Hierarchy

    No full text
    The RAMpage memory hierarchy is an alternative to the traditional division between cache and main memory: main memory is moved up a level and DRAM is used as a paging device. The idea behind RAMpage is to reduce hardware complexity, if at the cost of software complexity, with a view to allowing more flexible memory system design. This paper investigates some issues in choosing between RAMpage and a conventional cache architecture, with a view to illustrating trade-offs which can be made in choosing whether to place complexity in the memory system in hardware or in software. Performance results in this paper are based on a simple Rambus implementation of DRAM, with performance characteristics of Direct Rambus, which should be available in 1999. This paper explores the conditions under which it becomes feasible to perform a context switch on a miss in the RAMpage model, and the conditions under which RAMpage is a win over a conventional cache architecture: as the CPU-DRAM speed gap grows..
    corecore