The attention mechanism is the key to large language models, and the
attention matrix serves as an algorithmic and computational bottleneck for such
a scheme. In this paper, we define two problems, motivated by designing fast
algorithms for proxy of attention matrix and solving regressions against them.
Given an input matrix AβRnΓd with nβ«d and a
response vector b, we first consider the matrix exponential of the matrix
Aβ€A as a proxy, and we in turn design algorithms for two types of
regression problems: minxβRdββ₯(Aβ€A)jxβbβ₯2β and
minxβRdββ₯A(Aβ€A)jxβbβ₯2β for any positive integer j.
Studying algorithms for these regressions is essential, as matrix exponential
can be approximated term-by-term via these smaller problems. The second proxy
is applying exponential entrywise to the Gram matrix, denoted by
exp(AAβ€) and solving the regression minxβRnββ₯exp(AAβ€)xβbβ₯2β. We call this problem the attention
kernel regression problem, as the matrix exp(AAβ€) could be viewed as a
kernel function with respect to A. We design fast algorithms for these
regression problems, based on sketching and preconditioning. We hope these
efforts will provide an alternative perspective of studying efficient
approximation of attention matrices.Comment: AISTATS 202