Search CORE

82,730 research outputs found

Scale-out NUMA

Author: Chapman M.
Davis M.
Lotfi-Kamran P.
Mackenzie K.
Mitchell C.
Pakin S.
Qbb IEEE
Recio R.
Scott S. L.
Snell Q. O.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/01/2014
Field of study

Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as key-value stores and graph-based applications that rely on large irregular data structures. The fine-grained nature of the accesses is a poor match to commodity networking technologies, including RDMA, which incur delays of 10-1000x over local DRAM operations. We introduce Scale-Out NUMA (soNUMA) – an architecture, programming model, and communication protocol for low-latency, distributed in-memory processing. soNUMA layers an RDMA-inspired programming model directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, soNUMA relies on the remote memory controller – a new architecturally-exposed hardware block integrated into the node’s local coherence hierarchy. Our results based on cycle-accurate full-system simulation show that soNUMA performs remote reads at latencies that are within 4x of local DRAM, can fully utilize the available memory bandwidth, and can issue up to 10M remote memory operations per second per core

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Edinburgh Research Explorer

Multi-GPU Graph Analytics

Author: Owens John D.
Pan Yuechao
Wang Yangzihao
Wu Yuduo
Yang Carl
Publication venue
Publication date: 01/03/2017
Field of study

We present a single-node, multi-GPU programmable graph processing library that allows programmers to easily extend single-GPU graph algorithms to achieve scalable performance on large graphs with billions of edges. Directly using the single-GPU implementations, our design only requires programmers to specify a few algorithm-dependent concerns, hiding most multi-GPU related implementation details. We analyze the theoretical and practical limits to scalability in the context of varying graph primitives and datasets. We describe several optimizations, such as direction optimizing traversal, and a just-enough memory allocation scheme, for better performance and smaller memory consumption. Compared to previous work, we achieve best-of-class performance across operations and datasets, including excellent strong and weak scalability on most primitives as we increase the number of GPUs in the system.Comment: 12 pages. Final version submitted to IPDPS 201

arXiv.org e-Print Archive

Crossref

eScholarship - University of California