Large language models (LLMs) have significantly improved various aspects of
our daily lives. These models have impacted numerous domains, from healthcare
to education, enhancing productivity, decision-making processes, and
accessibility. As a result, they have influenced and, to some extent, reshaped
people's lifestyles. However, the quadratic complexity of attention in
transformer architectures poses a challenge when scaling up these models for
processing long textual contexts. This issue makes it impractical to train very
large models on lengthy texts or use them efficiently during inference. While a
recent study by [KMZ23] introduced a technique that replaces the softmax with a
polynomial function and polynomial sketching to speed up attention mechanisms,
the theoretical understandings of this new approach are not yet well
understood.
In this paper, we offer a theoretical analysis of the expressive capabilities
of polynomial attention. Our study reveals a disparity in the ability of
high-degree and low-degree polynomial attention. Specifically, we construct two
carefully designed datasets, namely D0β and D1β, where
D1β includes a feature with a significantly larger value compared
to D0β. We demonstrate that with a sufficiently high degree
Ξ², a single-layer polynomial attention network can distinguish between
D0β and D1β. However, with a low degree Ξ², the
network cannot effectively separate the two datasets. This analysis underscores
the greater effectiveness of high-degree polynomials in amplifying large values
and distinguishing between datasets. Our analysis offers insight into the
representational capacity of polynomial attention and provides a rationale for
incorporating higher-degree polynomials in attention mechanisms to capture
intricate linguistic correlations.Comment: arXiv admin note: substantial text overlap with arXiv:2310.1168