27 research outputs found
Reducing Memory Requirements for the IPU using Butterfly Factorizations
High Performance Computing (HPC) benefits from different improvements during
last decades, specially in terms of hardware platforms to provide more
processing power while maintaining the power consumption at a reasonable level.
The Intelligence Processing Unit (IPU) is a new type of massively parallel
processor, designed to speedup parallel computations with huge number of
processing cores and on-chip memory components connected with high-speed
fabrics. IPUs mainly target machine learning applications, however, due to the
architectural differences between GPUs and IPUs, especially significantly less
memory capacity on an IPU, methods for reducing model size by sparsification
have to be considered. Butterfly factorizations are well-known replacements for
fully-connected and convolutional layers. In this paper, we examine how
butterfly structures can be implemented on an IPU and study their behavior and
performance compared to a GPU. Experimental results indicate that these methods
can provide 98.5% compression ratio to decrease the immense need for memory,
the IPU implementation can benefit from 1.3x and 1.6x performance improvement
for butterfly and pixelated butterfly, respectively. We also reach to 1.62x
training time speedup on a real-word dataset such as CIFAR10
Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPU
Dedicated accelerator hardware has become essential for processing AI-based
workloads, leading to the rise of novel accelerator architectures. Furthermore,
fundamental differences in memory architecture and parallelism have made these
accelerators targets for scientific computing.
The sequence alignment problem is fundamental in bioinformatics; we have
implemented the -Drop algorithm, a heuristic method for pairwise alignment
that reduces search space, on the Graphcore Intelligence Processor Unit (IPU)
accelerator. The -Drop algorithm has an irregular computational pattern,
which makes it difficult to accelerate due to load balancing.
Here, we introduce a graph-based partitioning and queue-based batch system to
improve load balancing. Our implementation achieves speedup over a
state-of-the-art GPU implementation and up to compared to CPU. In
addition, we introduce a memory-restricted -Drop algorithm that reduces
memory footprint by and efficiently uses the IPU's limited
low-latency SRAM. This optimization further improves the strong scaling
performance by .Comment: 12 pages, 7 figures, 2 table