532 research outputs found

    Efficient Execution of Sequential Instructions Streams by Physical Machines

    Get PDF
    Any computational model which relies on a physical system is likely to be subject to the fact that information density and speed have intrinsic, ultimate limits. The RAM model, and in particular the underlying assumption that memory accesses can be carried out in time independent from memory size itself, is not physically implementable. This work has developed in the field of limiting technology machines, in which it is somewhat provocatively assumed that technology has achieved the physical limits. The ultimate goal for this is to tackle the problem of the intrinsic latencies of physical systems by encouraging scalable organizations for processors and memories. An algorithmic study is presented, which depicts the implementation of high concurrency programs for SP and SPE, sequential machine models able to compute direct-flow programs in optimal time. Then, a novel pieplined, hierarchical memory organization is presented, with optimal latency and bandwidth for a physical system. In order to both take full advantage of the memory capabilities and exploit the available instruction level parallelism of the code to be executed, a novel processor model is developed. Particular care is put in devising an efficient information flow within the processor itself. Both designs are extremely scalable, as they are based on fixed capacity and fixed size nodes, which are connected as a multidimensional array. Performance analysis on the resulting machine design has led to the discovery that latencies internal to the processor can be the dominating source of complexity in instruction flow execution, which adds to the effects of processor-memory interaction. A characterization of instruction flows is then developed, which is based on the topology induced by instruction dependences

    Partitioning problems in parallel, pipelined and distributed computing

    Get PDF
    The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest

    Design of testbed and emulation tools

    Get PDF
    The research summarized was concerned with the design of testbed and emulation tools suitable to assist in projecting, with reasonable accuracy, the expected performance of highly concurrent computing systems on large, complete applications. Such testbed and emulation tools are intended for the eventual use of those exploring new concurrent system architectures and organizations, either as users or as designers of such systems. While a range of alternatives was considered, a software based set of hierarchical tools was chosen to provide maximum flexibility, to ease in moving to new computers as technology improves and to take advantage of the inherent reliability and availability of commercially available computing systems

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

    Pipelined implementation of Jpeg image compression using Hdl

    Full text link
    This thesis presents the architecture and design of a JPEG compressor for color images using VHDL. The system consists of major parts like color space converter, down sampler, 2-D DCT module, quantization, zigzag scanning and entropy coDing The color space conversion transforms the RGB colors to YCbCr color coDing The down sampling operation reduces the sampling rate of the color information (Cb and Cr). The 2-D DCT transform the pixel data from the spatial domain to the frequency domain. The quantization operation eliminates the high frequency components and the small amplitude coefficients of the co-sine expansion. Finally, the entropy coding uses run-length encoding (RLE), Huffman, variable length coding (VLC) and differential coding to decrease the number of bits used to represent the image. The JPEG compression is a lossy compression, since downsampling and quantization operations are irreversible. But the losses can be controlled in order to keep the necessary image quality; Architectures for these parts were designed and described in VHDL. The results were observed using Active-HDL simulator and the code being synthesized using xilinx ise for vertex-4 FPGA. This pipelined architecture has a minimum latency of 187 clock cycles

    ASC: A stream compiler for computing with FPGAs

    No full text
    Published versio

    Architectural Approaches For Gallium Arsenide Exploitation In High-Speed Computer Design

    Get PDF
    Continued advances in the capability of Gallium Arsenide (GaAs)technology have finally drawn serious interest from computer system designers. The recent demonstration of very large scale integration (VLSI) laboratory designs incorporating very fast GaAs logic gates herald a significant role for GaAs technology in high-speed computer design:1 In this thesis we investigate design approaches to best exploit this promising technology in high-performance computer systems. We find significant differences between GaAs and Silicon technologies which are of relevance for computer design. The advantage that GaAs enjoys over Silicon in faster transistor switching speed is countered by a lower transistor count capability for GaAs integrated circuits. In addition, inter-chip signal propagation speeds in GaAs systems do not experience the same speedup exhibited by GaAs transistors; thus, GaAs designs are penalized more severely by inter-chip communication. The relatively low density of GaAs chips and the high cost of communication between them are significant obstacles to the full exploitation of the fast transistors of GaAs technology. A fast GaAs processor may be excessively underutilized unless special consideration is given to its information (instructions and data) requirements. Desirable GaAs system design approaches encourage low hardware resource requirements, and either minimize the processor’s need for off-chip information, maximize the rate of off-chip information transfer, or overlap off-chip information transfer with useful computation. We show the impact that these considerations have on the design of the instruction format, arithmetic unit, memory system, and compiler for a GaAs computer system. Through a simulation study utilizing a set of widely-used benchmark programs, we investigate several candidate instruction pipelines and candidate instruction formats in a GaAs environment. We demonstrate the clear performance advantage of an instruction pipeline based upon a pipelined memory system over a typical Silicon-like pipeline. We also show the performance advantage of packed instruction formats over typical Silicon instruction formats, and present a packed format which performs better than the experimental packed Stanford MIPS format

    The "MIND" Scalable PIM Architecture

    Get PDF
    MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

    Towards adaptive balanced computing (ABC) using reconfigurable functional caches (RFCs)

    Get PDF
    The general-purpose computing processor performs a wide range of functions. Although the performance of general-purpose processors has been steadily increasing, certain software technologies like multimedia and digital signal processing applications demand ever more computing power. Reconfigurable computing has emerged to combine the versatility of general-purpose processors with the customization ability of ASICs. The basic premise of reconfigurability is to provide better performance and higher computing density than fixed configuration processors. Most of the research in reconfigurable computing is dedicated to on-chip functional logic. If computing resources are adaptable to the computing requirement, the maximum performance can be achieved. To overcome the gap between processor and memory technology, the size of on-chip cache memory has been consistently increasing. The larger cache memory capacity, though beneficial in general, does not guarantee a higher performance for all the applications as they may not utilize all of the cache efficiently. To utilize on-chip resources effectively and to accelerate the performance of multimedia applications specifically, we propose a new architecture---Adaptive Balanced Computing (ABC). ABC uses dynamic resource configuration of on-chip cache memory by integrating Reconfigurable Functional Caches (RFC). RFC can work as a conventional cache or as a specialized computing unit when necessary. In order to convert a cache memory to a computing unit, we include additional logic to embed multi-bit output LUTs into the cache structure. We add the reconfigurability of cache memory to a conventional processor with minimal modification to the load/store microarchitecture and with minimal compiler assistance. ABC architecture utilizes resources more efficiently by reconfiguring the cache memory to computing units dynamically. The area penalty for this reconfiguration is about 50--60% of the memory cell cache array-only area with faster cache access time. In a base array cache (parallel decoding caches), the area penalty is 10--20% of the data array with 1--2% increase in the cache access time. However, we save 27% for FIR and 44% for DCT/IDCT in area with respect to memory cell array cache and about 80% for both applications with respect to base array cache if we were to implement all these units separately (such as ASICs). The simulations with multimedia and DSP applications (DCT/IDCT and FIR/IIR) show that the resource configuration with the RFC speedups ranging from 1.04X to 3.94X in overall applications and from 2.61X to 27.4X in the core computations. The simulations with various parameters indicate that the impact of reconfiguration can be minimized if an appropriate cache organization is selected
    • …
    corecore