Number Theoretic Transform (NTT) is an essential mathematical tool for
computing polynomial multiplication in promising lattice-based cryptography.
However, costly division operations and complex data dependencies make
efficient and flexible hardware design to be challenging, especially on
resource-constrained edge devices. Existing approaches either focus on only
limited parameter settings or impose substantial hardware overhead. In this
paper, we introduce a hardware-algorithm methodology to efficiently accelerate
NTT in various settings using in-cache computing. By leveraging an optimized
bit-parallel modular multiplication and introducing costless shift operations,
our proposed solution provides up to 29x higher throughput-per-area and
2.8-100x better throughput-per-area-per-joule compared to the state-of-the-art.Comment: This work is accepted to the 60th Design Automation Conference (DAC),
202