4,868 research outputs found

    Access to vectors in multi-module memories

    Get PDF
    The poor bandwidth obtained from memory when conflicts arise in the modules or in the interconnection network degrades the performance of computers. Address transformation schemes, such as interleaving, skewing and linear transformations, have been proposed to achieve conflict-free access for streams with constant stride. However, this is achieved only for some strides. In this paper, we summarize a mechanism to request the elements in an out-of-order way which allows to achieve conflict-free access for a larger number of strides. We study the cases of a single vector processor and of a vector multiprocessor system. For this latter case, we propose a synchronous mode of accessing memory that can be applied in SIMD machines or in MIMD systems with decoupled access and execution.Peer ReviewedPostprint (published version

    Access to streams in multiprocessor systems

    Get PDF
    When accessing streams in vector multiprocessor machines, degradation in the interconnection network and conflicts in the memory modules are the factors that reduce the efficiency of the system. In this paper, we present a synchronous access mechanism that allows conflict-free access to streams in a SIMD vector multiprocessor system. Each processor accesses the corresponding elements out of order, in such a way that in each cycle the requested elements do not collide in the interconnection network. Moreover, memory modules are accessed so that conflicts are avoided. The use of the proposed mechanism in present-day architectures would allow conflict-free access to streams with the most common strides that appear in real applications. The additional hardware is described and is shown to be of a similar complexity as that required for access in order.Postprint (published version

    Static locality analysis for cache management

    Get PDF
    Most memory references in numerical codes correspond to array references whose indices are affine functions of surrounding loop indices. These array references follow a regular predictable memory pattern that can be analysed at compile time. This analysis can provide valuable information like the locality exhibited by the program, which can be used to implement more intelligent caching strategy. In this paper we propose a static locality analysis oriented to the management of data caches. We show that previous proposals on locality analysis are not appropriate when the proposals have a high conflict miss ratio. This paper examines those proposals by introducing a compile-time interference analysis that significantly improve the performance of them. We first show how this analysis can be used to characterize the dynamic locality properties of numerical codes. This evaluation show for instance that a large percentage of references exhibit any type of locality. This motivates the use of a dual data cache, which has a module specialized to exploit temporal locality, and a selective cache respectively. Then, the performance provided by these two cache organizations is evaluated. In both organizations, the static locality analysis is responsible for tagging each memory instruction accordingly to the particular type(s) of locality that it exhibits.Peer ReviewedPostprint (published version

    Improving GPU Shared Memory Access Efficiency

    Get PDF
    Graphic Processing Units (GPUs) often employ shared memory to provide efficient storage for threads within a computational block. This shared memory includes multiple banks to improve performance by enabling concurrent accesses across the memory banks. Conflicts occur when multiple memory accesses attempt to simultaneously access a particular bank, resulting in serialized access and concomitant performance reduction. Identifying and eliminating these memory bank access conflicts becomes critical for achieving high performance on GPUs; however, for common 1D and 2D access patterns, understanding the potential bank conflicts can prove difficult. Current GPUs support memory bank accesses with configurable bit-widths; optimizing these bitwidths could result in data layouts with fewer conflicts and better performance. This dissertation presents a framework for bank conflict analysis and automatic optimization. Given static access pattern information for a kernel, this tool analyzes the conflict number of each pattern, and then searches for an optimized solution for all shared memory buffers. This data layout solution is based on parameters for inter-padding, intrapadding, and the bank access bit-width. The experimental results show that static bank conflict analysis is a practical solution and independent of the workload size of a given access pattern. For 13 kernels from 6 benchmarks suites (RODINIA and NVIDIA CUDA SDK) facing shared memory bank conflicts, tests indicated this approach can gain 5%- 35% improvement in runtime

    The Environmental Impact and Formation of Meals from the Pilot Year of a Las Vegas Convention Food Rescue Program

    Get PDF
    Annually, millions of tonnes of leftover edible foods are sent to landfill. Not only does this harm the environment by increasing the release of greenhouse gases which contribute to climate change, but it poses a question of ethics given that nearly 16 million households are food insecure in the US, and hundreds of millions of people around the globe. The purpose of this study was to document the amount of food diverted from landfill in the pilot year of a convention food rescue program and to determine the amount of greenhouse gas (GHG) emissions avoided by the diversion of such food. In the pilot year of the convention food rescue program 24,703 kg of food were diverted. It is estimated that 108 metric tonnes of GHG emmisions were avoided as a result, while 45,383 meals for food insecure individuals were produced. These findings have significant implications for public and environmental health, as GHG emissions have a destructive effect on the earth’s atmosphere and rescued food can be redistributed to food insecure individuals

    Randomized cache placement for eliminating conflicts

    Get PDF
    Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects. We present measurements of instructions committed per cycle (IPC) when comparing the performance of different cache architectures on whole-program benchmarks such as the SPEC95 suite.Peer ReviewedPostprint (published version

    Design of a parallel vector access unit for SDRAM memory systems

    Get PDF
    Journal ArticleParallel Vector Access is a technique that exploits the regularity of vector or stream accesses to perform them efficiently in parallel on a multi-bank memory system. The performance of applications that have vector accesses may be improved using a memory controller that performs scatter/gather operations so that only the vector or stream elements that are accessed by the application are transmitted across the system bus. These scatter/gather operations can be speeded up by broadcasting vector operations to all banks of memory in parallel, each of which implements an algorithm to determine which elements of the requested vector they contain. This thesis presents the mathematical foundations behind one such algorithm for controller are investigated. The the performance of such a memory controller on vector kernels is studied by gate level simulation and the results analyzed. Because of the parallel approach, the PVA is able to load elements up to 32.8 times faster than a conventional memory system and 3.3 times faster than a pipelined vector unit, without hurting normal cache line fill performance

    Development of Bone Targeting Drugs.

    Get PDF
    The skeletal system, comprising bones, ligaments, cartilage and their connective tissues, is critical for the structure and support of the body. Diseases that affect the skeletal system can be difficult to treat, mainly because of the avascular cartilage region. Targeting drugs to the site of action can not only increase efficacy but also reduce toxicity. Bone-targeting drugs are designed with either of two general targeting moieties, aimed at the entire skeletal system or a specific cell type. Most bone-targeting drugs utilize an affinity to hydroxyapatite, a major component of the bone matrix that includes a high concentration of positively-charged Ca(2+). The strategies for designing such targeting moieties can involve synthetic and/or biological components including negatively-charged amino acid peptides or bisphosphonates. Efficient delivery of bone-specific drugs provides significant impact in the treatment of skeletal related disorders including infectious diseases (osteoarthritis, osteomyelitis, etc.), osteoporosis, and metabolic skeletal dysplasia. Despite recent advances, however, both delivering the drug to its target without losing activity and avoiding adverse local effects remain a challenge. In this review, we investigate the current development of bone-targeting moieties, their efficacy and limitations, and discuss future directions for the development of these specific targeted treatments

    Impulse: building a smarter memory controller

    Get PDF
    Journal ArticleImpulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accessed and cached, improving their cache and bus utilization. Second, Impulse supports prefetching at the memory controller, which can hide much of the latency of DRAM accesses. In this paper we describe the design of the Impulse architecture, and show how an Impulse memory system can be used to improve the performance of memory-bound programs. For the NAS conjugate gradient benchmark, Impulse improves performance by 67%. Because it requires no modification to processor, cache, or bus designs, Impulse can be adopted in conventional systems. In addition to scientific applications, we expect that Impulse will benefit regularly strided, memory-bound applications of commercial importance, such as database and multimedia programs
    • …
    corecore