946,105 research outputs found
Exploring Application Performance on Emerging Hybrid-Memory Supercomputers
Next-generation supercomputers will feature more hierarchical and
heterogeneous memory systems with different memory technologies working
side-by-side. A critical question is whether at large scale existing HPC
applications and emerging data-analytics workloads will have performance
improvement or degradation on these systems. We propose a systematic and fair
methodology to identify the trend of application performance on emerging
hybrid-memory systems. We model the memory system of next-generation
supercomputers as a combination of "fast" and "slow" memories. We then analyze
performance and dynamic execution characteristics of a variety of workloads,
from traditional scientific applications to emerging data analytics to compare
traditional and hybrid-memory systems. Our results show that data analytics
applications can clearly benefit from the new system design, especially at
large scale. Moreover, hybrid-memory systems do not penalize traditional
scientific applications, which may also show performance improvement.Comment: 18th International Conference on High Performance Computing and
Communications, IEEE, 201
Command vector memory systems: high performance at low cost
The focus of this paper is on designing both a low cost and high performance, high bandwidth vector memory system that takes advantage of modern commodity SDRAM memory chips. To successfully extract the full bandwidth from SDRAM parts, we propose a new memory system organization based on sending commands to the memory system as opposed to sending individual addresses. A command specifies, in a few bytes, a request for multiple independent memory words. A command is similar to a burst found in DRAM memories, but does not require the memory words to be consecutive. The command is sent to all sections of the memory array simultaneously, thus not requiring a crossbar in the proper sense. Our simulations show that this command based memory system can improve performance over a traditional SDRAM-based memory system by factors that range between 1.15 up to 1.54. Moreover, in many cases, the command memory system outperforms even the best SRAM memory system under consideration. Overall the command based memory system achieves similar or better results than a 10 ns SRAM memory system (a) using fewer banks and (b) using memory devices that are between 15 to 60 times cheaper.Peer ReviewedPostprint (published version
64-bit architechtures and compute clusters for high performance simulations
Simulation of large complex systems remains one of the most demanding
of high performance computer systems both in terms of raw compute performance
and efficient memory management. Recent availability of 64-bit
architectures has opened up the possibilities of commodity computers accessing
more than the 4 Gigabyte memory limit previously enforced by 32-bit
addressing. We report on some performance measurements we have made on
two 64-bit architectures and their consequences for some high performance
simulations. We discuss performance of our codes for simulations of artificial
life models; computational physics models of point particles on lattices; and
with interacting clusters of particles. We have summarised pertinent features
of these codes into benchmark kernels which we discuss in the context of wellknown
benchmark kernels of the 32-bit era. We report on how these these
findings were useful in the context of designing 64-bit compute clusters for
high-performance simulations
GraphH: High Performance Big Graph Analytics in Small Clusters
It is common for real-world applications to analyze big graphs using
distributed graph processing systems. Popular in-memory systems require an
enormous amount of resources to handle big graphs. While several out-of-core
approaches have been proposed for processing big graphs on disk, the high disk
I/O overhead could significantly reduce performance. In this paper, we propose
GraphH to enable high-performance big graph analytics in small clusters.
Specifically, we design a two-stage graph partition scheme to evenly divide the
input graph into partitions, and propose a GAB (Gather-Apply-Broadcast)
computation model to make each worker process a partition in memory at a time.
We use an edge cache mechanism to reduce the disk I/O overhead, and design a
hybrid strategy to improve the communication performance. GraphH can
efficiently process big graphs in small clusters or even a single commodity
server. Extensive evaluations have shown that GraphH could be up to 7.8x faster
compared to popular in-memory systems, such as Pregel+ and PowerGraph when
processing generic graphs, and more than 100x faster than recently proposed
out-of-core systems, such as GraphD and Chaos when processing big graphs
- …