3 research outputs found
Performance Impact of Memory Channels on Sparse and Irregular Algorithms
Graph processing is typically considered to be a memory-bound rather than
compute-bound problem. One common line of thought is that more available memory
bandwidth corresponds to better graph processing performance. However, in this
work we demonstrate that the key factor in the utilization of the memory system
for graph algorithms is not necessarily the raw bandwidth or even the latency
of memory requests. Instead, we show that performance is proportional to the
number of memory channels available to handle small data transfers with limited
spatial locality.
Using several widely used graph frameworks, including Gunrock (on the GPU)
and GAPBS \& Ligra (for CPUs), we evaluate key graph analytics kernels using
two unique memory hierarchies, DDR-based and HBM/MCDRAM. Our results show that
the differences in the peak bandwidths of several Pascal-generation GPU memory
subsystems aren't reflected in the performance of various analytics.
Furthermore, our experiments on CPU and Xeon Phi systems demonstrate that the
number of memory channels utilized can be a decisive factor in performance
across several different applications. For CPU systems with smaller thread
counts, the memory channels can be underutilized while systems with high thread
counts can oversaturate the memory subsystem, which leads to limited
performance. Finally, we model the potential performance improvements of adding
more memory channels with narrower access widths than are found in current
platforms, and we analyze performance trade-offs for the two most prominent
types of memory accesses found in graph algorithms, streaming and random
accesses
Graph Processing on FPGAs: Taxonomy, Survey, Challenges
Graph processing has become an important part of various areas, such as
machine learning, computational sciences, medical applications, social network
analysis, and many others. Various graphs, for example web or social networks,
may contain up to trillions of edges. The sheer size of such datasets, combined
with the irregular nature of graph processing, poses unique challenges for the
runtime and the consumed power. Field Programmable Gate Arrays (FPGAs) can be
an energy-efficient solution to deliver specialized hardware for graph
processing. This is reflected by the recent interest in developing various
graph algorithms and graph processing frameworks on FPGAs. To facilitate
understanding of this emerging domain, we present the first survey and taxonomy
on graph computations on FPGAs. Our survey describes and categorizes existing
schemes and explains key ideas. Finally, we discuss research and engineering
challenges to outline the future of graph computations on FPGAs
A Survey on Graph Processing Accelerators: Challenges and Opportunities
Graph is a well known data structure to represent the associated
relationships in a variety of applications, e.g., data science and machine
learning. Despite a wealth of existing efforts on developing graph processing
systems for improving the performance and/or energy efficiency on traditional
architectures, dedicated hardware solutions, also referred to as graph
processing accelerators, are essential and emerging to provide the benefits
significantly beyond those pure software solutions can offer. In this paper, we
conduct a systematical survey regarding the design and implementation of graph
processing accelerator. Specifically, we review the relevant techniques in
three core components toward a graph processing accelerator: preprocessing,
parallel graph computation and runtime scheduling. We also examine the
benchmarks and results in existing studies for evaluating a graph processing
accelerator. Interestingly, we find that there is not an absolute winner for
all three aspects in graph acceleration due to the diverse characteristics of
graph processing and complexity of hardware configurations. We finially present
to discuss several challenges in details, and to further explore the
opportunities for the future research.Comment: This article has been accepted by Journal of Computer Science and
Technolog