1,171 research outputs found
Reducing Memory Requirements for the IPU using Butterfly Factorizations
High Performance Computing (HPC) benefits from different improvements during
last decades, specially in terms of hardware platforms to provide more
processing power while maintaining the power consumption at a reasonable level.
The Intelligence Processing Unit (IPU) is a new type of massively parallel
processor, designed to speedup parallel computations with huge number of
processing cores and on-chip memory components connected with high-speed
fabrics. IPUs mainly target machine learning applications, however, due to the
architectural differences between GPUs and IPUs, especially significantly less
memory capacity on an IPU, methods for reducing model size by sparsification
have to be considered. Butterfly factorizations are well-known replacements for
fully-connected and convolutional layers. In this paper, we examine how
butterfly structures can be implemented on an IPU and study their behavior and
performance compared to a GPU. Experimental results indicate that these methods
can provide 98.5% compression ratio to decrease the immense need for memory,
the IPU implementation can benefit from 1.3x and 1.6x performance improvement
for butterfly and pixelated butterfly, respectively. We also reach to 1.62x
training time speedup on a real-word dataset such as CIFAR10
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
A Survey of Graph Pre-processing Methods: From Algorithmic to Hardware Perspectives
Graph-related applications have experienced significant growth in academia
and industry, driven by the powerful representation capabilities of graph.
However, efficiently executing these applications faces various challenges,
such as load imbalance, random memory access, etc. To address these challenges,
researchers have proposed various acceleration systems, including software
frameworks and hardware accelerators, all of which incorporate graph
pre-processing (GPP). GPP serves as a preparatory step before the formal
execution of applications, involving techniques such as sampling, reorder, etc.
However, GPP execution often remains overlooked, as the primary focus is
directed towards enhancing graph applications themselves. This oversight is
concerning, especially considering the explosive growth of real-world graph
data, where GPP becomes essential and even dominates system running overhead.
Furthermore, GPP methods exhibit significant variations across devices and
applications due to high customization. Unfortunately, no comprehensive work
systematically summarizes GPP. To address this gap and foster a better
understanding of GPP, we present a comprehensive survey dedicated to this area.
We propose a double-level taxonomy of GPP, considering both algorithmic and
hardware perspectives. Through listing relavent works, we illustrate our
taxonomy and conduct a thorough analysis and summary of diverse GPP techniques.
Lastly, we discuss challenges in GPP and potential future directions
Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures
Continual demand for memory bandwidth has made it worthwhile for memory
vendors to reassess processing in memory (PIM), which enables higher bandwidth
by placing compute units in/near-memory. As such, memory vendors have recently
proposed commercially viable PIM designs. However, these proposals are largely
driven by the needs of (a narrow set of) machine learning (ML) primitives.
While such proposals are reasonable given the the growing importance of ML, as
memory is a pervasive component, %in this work, we make there is a case for a
more inclusive PIM design that can accelerate primitives across domains.
In this work, we ascertain the capabilities of commercial PIM proposals to
accelerate various primitives across domains. We first begin with outlining a
set of characteristics, termed PIM-amenability-test, which aid in assessing if
a given primitive is likely to be accelerated by PIM. Next, we apply this test
to primitives under study to ascertain efficient data-placement and
orchestration to map the primitives to underlying PIM architecture. We observe
here that, even though primitives under study are largely PIM-amenable,
existing commercial PIM proposals do not realize their performance potential
for these primitives. To address this, we identify bottlenecks that arise in
PIM execution and propose hardware and software optimizations which stand to
broaden the acceleration reach of commercial PIM designs (improving average PIM
speedups from 1.12x to 2.49x relative to a GPU baseline). Overall, while we
believe emerging commercial PIM proposals add a necessary and complementary
design point in the application acceleration space, hardware-software co-design
is necessary to deliver their benefits broadly
Portable performance on heterogeneous architectures
Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine.
To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.United States. Dept. of Energy (Award DE-SC0005288)United States. Defense Advanced Research Projects Agency (Award HR0011-10-9-0009)National Science Foundation (U.S.) (Award CCF-0632997
- …