This paper presents an investigation into long-tail video recognition. We
demonstrate that, unlike naturally-collected video datasets and existing
long-tail image benchmarks, current video benchmarks fall short on multiple
long-tailed properties. Most critically, they lack few-shot classes in their
tails. In response, we propose new video benchmarks that better assess
long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT.
We then propose a method, Long-Tail Mixed Reconstruction, which reduces
overfitting to instances from few-shot classes by reconstructing them as
weighted combinations of samples from head classes. LMR then employs label
mixing to learn robust decision boundaries. It achieves state-of-the-art
average class accuracy on EPIC-KITCHENS and the proposed SSv2-LT and
VideoLT-LT. Benchmarks and code at: tobyperrett.github.io/lmrComment: CVPR 202