Search CORE

27 research outputs found

Reducing Memory Requirements for the IPU using Butterfly Factorizations

Author: Alles Christian
Fröning Holger
Shekofteh S. -Kazem
Publication venue
Publication date: 16/09/2023
Field of study

High Performance Computing (HPC) benefits from different improvements during last decades, specially in terms of hardware platforms to provide more processing power while maintaining the power consumption at a reasonable level. The Intelligence Processing Unit (IPU) is a new type of massively parallel processor, designed to speedup parallel computations with huge number of processing cores and on-chip memory components connected with high-speed fabrics. IPUs mainly target machine learning applications, however, due to the architectural differences between GPUs and IPUs, especially significantly less memory capacity on an IPU, methods for reducing model size by sparsification have to be considered. Butterfly factorizations are well-known replacements for fully-connected and convolutional layers. In this paper, we examine how butterfly structures can be implemented on an IPU and study their behavior and performance compared to a GPU. Experimental results indicate that these methods can provide 98.5% compression ratio to decrease the immense need for memory, the IPU implementation can benefit from 1.3x and 1.6x performance improvement for butterfly and pixelated butterfly, respectively. We also reach to 1.62x training time speedup on a real-word dataset such as CIFAR10

arXiv.org e-Print Archive

Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPU

Author: Buluç Aydın
Burchard Luk
Guidi Giulia
Langguth Johannes
Zhao Max Xiaohang
Publication venue
Publication date: 17/04/2023
Field of study

Dedicated accelerator hardware has become essential for processing AI-based workloads, leading to the rise of novel accelerator architectures. Furthermore, fundamental differences in memory architecture and parallelism have made these accelerators targets for scientific computing. The sequence alignment problem is fundamental in bioinformatics; we have implemented the

X

-Drop algorithm, a heuristic method for pairwise alignment that reduces search space, on the Graphcore Intelligence Processor Unit (IPU) accelerator. The

X

-Drop algorithm has an irregular computational pattern, which makes it difficult to accelerate due to load balancing. Here, we introduce a graph-based partitioning and queue-based batch system to improve load balancing. Our implementation achieves

10\times

speedup over a state-of-the-art GPU implementation and up to

4.65\times

compared to CPU. In addition, we introduce a memory-restricted

X

-Drop algorithm that reduces memory footprint by

55\times

and efficiently uses the IPU's limited low-latency SRAM. This optimization further improves the strong scaling performance by

3.6\times

.Comment: 12 pages, 7 figures, 2 table

arXiv.org e-Print Archive