8 research outputs found

    Module Partitioning and Interlaced Data Placement Schemes to Reduce Conflicts in Interleaved Memories

    No full text
    In interleaved memories, interference between concurrently active vector streams results in memory bank conflicts and reduced bandwidth. In this paper, we present two schemes for reducing inter-vector interference. First, we propose a memory module partitioning technique in which disjoint access sets are created for each of the concurrent vectors. Various properties of the involved address mapping are presented. Then we present an interlaced data placement scheme, where the simultaneously accessed vectors are interlaced and stored to the memory. Performance of the two schemes are evaluated by trace driven simulation. It is observed that the schemes have significant merit in reducing the interference in interleaved memories and increasing the effective memory bandwidth. The schemes are applicable to memory systems for superscalar processors, vector supercomputers and parallel processors. Keywords: Address Distribution, Conflict-free access, Vector Interference, Memory Storage Patterns..

    Design and VLSI Implementation of an Access Processor for a Decoupled Architecture

    No full text
    Decoupled computer architectures provide high scalar performance by exploiting the fine--grained parallelism existing between the access and execute functions in a computer program. These architectures employ an access processor to perform data fetch ahead of demand by the execute process. Some of the decoupled architectures employ identical access and execute processors, but special processors to efficiently access data structures have also been proposed. In this paper, we present the design of an access processor targeted for VLSI implementation. The hardware elements, instruction set and instruction formats of the processor are presented in sufficient detail. The various trade-offs involved in the design are explained. Details of implementation of a scaled down version of the processor are presented. * This work was supported in part by the National Science Foundation under grant number MIP--8912455. 1. Introduction Decoupled computer architectures exploit fine--grain parallelism..

    A Comparative Evaluation of Software Techniques to Hide Memory Latency

    No full text
    Software oriented techniques to hide memory latency in superscalar and superpipelined machines include loop unrolling, software pipelining, and software cache prefetching. Issuing the data fetch request prior to actual need for data allows overlap of accessing with useful computations. Loop unrolling and software pipelining do not necessitate microarchitecture or instruction set architecture changes, whereas software controlled prefetching does. While studies on the benefits of the individual techniques have been done, no study evaluates all of these techniques within a consistent framework. This paper attempts to remedy this by providing a comparative evaluation of the features and benefits of the techniques. Loop unrolling and static scheduling of loads is seen to produce significant improvement in performance at lower latencies. Software pipelining is observed to be better than software controlled prefetching at lower latencies, but at higher latencies, software prefetching outperfo..

    Classification and Performance Evaluation of Instruction Buffering Techniques

    No full text
    The speed disparity between processor and memory subsystems has been bridged in many existing large-scale scientific computers and microprocessors with the help of instruction buffers or instruction caches. In this paper we classify these buffers into traditional instruction buffers, conventional instruction caches and prefetch queues, detail their prominent features, and evaluate the performance of buffers in several existing systems, using trace driven simulation. We compare these schemes with a recently proposed queue-based instruction cache memory. An implementation independent performance metric is proposed for the various organizations and used for the evaluations. We analyze the simulation results and discuss the effect of various parameters such as prefetch threshold, bus width and buffer size on performance

    Architecture For A Non-Deterministic Simulation Machine

    No full text
    Causalityconstraintsofrandomdiscretesimulationmake parallelanddistributedprocessingdifficult.Methodsof applyingreconfigurablelogictoimplementandaccelerate simulationserviceeventqueuesarepresentedwhich processsimulationeventsatarateofoneeventper 80nanoseconds.Theeventgeneratorpresentedinour previouswork(BumbleandCoraor1998)isalsocapable ofsustainingthe80nsclockrate,providingoverallspeedup rateswhichdependonthesoftwarecomparisonscenario. Thesoftwarecomparisoncitedinthisworkprovidesa2 orderofmagnitudespeedup.Thespeedupfactorvarieswith thesizeofthesoftwareeventqueue.FieldProgrammable GateArrays(FPGAs)areusedtoimplementandtestthe servicequeuedesign
    corecore