30 research outputs found

    ABSTRACT The Performance Advantage of Applying Compression to the Memory System

    No full text
    The memory system stores information comprising primarily instructions and data and secondarily address information, such as cache tag fields. It interacts with the processor by supporting related traffic (again comprising addresses, instructions, and data). Continuing exponential growth in processor performance, combined with technology, architecture, and application trends, place enormous demands on the memory system to permit this information storage and exchange at a high-enough performance (i.e., to provide low latency and high bandwidth access to large amounts of information). This paper comprehensively analyzes the redundancy in the information (addresses, instructions, and data) stored and exchanged between the processor and the memory system and evaluates the potential of compression in improving performance of the memory system. Analysis of traces obtained with Sun Microsystems’ Shade simulator simulating SPARC executables of nine integer and six floating-point programs in the SPEC CPU2000 benchmark suite yield impressive results. Well-designed compression schemes may provide benefits in performance that far outweigh the extra time and logic for compression and decompression. This will be more so in the future since the speed and size of logic (which will be used to perform compression/decompression) are improving and are projected to improve at a much higher rate compared to those of interconnect (which will be used to communicate the information), both on-chip and off-chip. Keyword

    Scalable Load Balancing Strategies for Parallel A* Algorithms

    No full text
    In this paper, we develop load balancing strategies for scalable high-performance parallel A* algorithms suitable for distributed-memory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algorithm) grow with the number of processors P used, thus restricting its scalability. To alleviate this effect, we propose a novel parallel startup phase and an efficient dynamic load balancing strategy called the quality equalizing (QE) strategy. Our new parallel startup scheme executes optimally in \Theta(logP ) time and, in addition, achieves good initial load balance. The QE strategy possesses certain unique quantitative and qualitative load balancing properties that enable it to significantly reduce starvation and nonessential work. Consequently, we obtain a highly scalable parallel A* algorithm with an almost-linear speedup. The startup and load balancing schemes were employed in parallel ..

    Efficient Network-Flow Based Techniques for Dynamic Fault Reconfiguration in FPGAs

    No full text
    In this paper, we consider a "dynamic" node covering framework for incorporating fault tolerance in SRAM-based segmented array FPGAs with spare row(s) and/or column(s) of cells. Two types of designs are considered: one that can support only node-disjoint (and hence nonintersecting) rectilinear reconfiguration paths, and the other that can support edge-disjoint (and hence possibly intersecting) rectilinear reconfiguration paths. The advantage of this approach is that reconfiguration paths are determined dynamically depending upon the actual set of faults and track segments are used as required, thus resulting in higher reconfigurability and lower track overheads compared to previously proposed "static" approaches. We provide optimal networkflow based reconfiguration algorithms for both of our designs and present and analyze a technique for speeding up these algorithms, depending upon the fault size, by as much as 20 times. Finally, we present reconfigurability results for our FPGA designs that show much better fault tolerance for them compared to previous approaches---the reconfigurability of the edge-disjoint design is 90% or better and 100% most of the time, which implies near-optimal spare-cell utilization

    Value-Based Bit Ordering for Energy Optimization of On-Chip Global Signal Buses

    No full text
    In this paper, we present a technique that exploits the sta-tistical behavior of data values transmitted on global signal buses to determine an energy-efficient ordering of bits that minimizes the inter-wire coupling energy and also reduces total bus energy. Statistics are collected for instruction and data bus traffic from eight SPEC CPU2K benchmarks and an optimization problem is formulated and solved optimally using a publicly-available tool. Results obtained using the optimal bit order on large non-overlapping test samples from the same set of benchmarks show that, on average, ad-jacent inter-wire coupling energies reduce by about 35.4% for instruction buses and by about 21.6 % for data buses us-ing the proposed technique. 1
    corecore