Dedicated accelerator hardware has become essential for processing AI-based
workloads, leading to the rise of novel accelerator architectures. Furthermore,
fundamental differences in memory architecture and parallelism have made these
accelerators targets for scientific computing.
The sequence alignment problem is fundamental in bioinformatics; we have
implemented the X-Drop algorithm, a heuristic method for pairwise alignment
that reduces search space, on the Graphcore Intelligence Processor Unit (IPU)
accelerator. The X-Drop algorithm has an irregular computational pattern,
which makes it difficult to accelerate due to load balancing.
Here, we introduce a graph-based partitioning and queue-based batch system to
improve load balancing. Our implementation achieves 10× speedup over a
state-of-the-art GPU implementation and up to 4.65× compared to CPU. In
addition, we introduce a memory-restricted X-Drop algorithm that reduces
memory footprint by 55× and efficiently uses the IPU's limited
low-latency SRAM. This optimization further improves the strong scaling
performance by 3.6×.Comment: 12 pages, 7 figures, 2 table