The conventional lottery ticket hypothesis (LTH) claims that there exists a
sparse subnetwork within a dense neural network and a proper random
initialization method, called the winning ticket, such that it can be trained
from scratch to almost as good as the dense counterpart. Meanwhile, the
research of LTH in vision transformers (ViTs) is scarcely evaluated. In this
paper, we first show that the conventional winning ticket is hard to find at
weight level of ViTs by existing methods. Then, we generalize the LTH for ViTs
to input images consisting of image patches inspired by the input dependence of
ViTs. That is, there exists a subset of input image patches such that a ViT can
be trained from scratch by using only this subset of patches and achieve
similar accuracy to the ViTs trained by using all image patches. We call this
subset of input patches the winning tickets, which represent a significant
amount of information in the input. Furthermore, we present a simple yet
effective method to find the winning tickets in input patches for various types
of ViT, including DeiT, LV-ViT, and Swin Transformers. More specifically, we
use a ticket selector to generate the winning tickets based on the
informativeness of patches. Meanwhile, we build another randomly selected
subset of patches for comparison, and the experiments show that there is clear
difference between the performance of models trained with winning tickets and
randomly selected subsets