Large-scale pretrained models such as LXMERT are becoming popular for
learning cross-modal representations on text-image pairs for vision-language
tasks. According to the lottery ticket hypothesis, NLP and computer vision
models contain smaller subnetworks capable of being trained in isolation to
full performance. In this paper, we combine these observations to evaluate
whether such trainable subnetworks exist in LXMERT when fine-tuned on the VQA
task. In addition, we perform a model size cost-benefit analysis by
investigating how much pruning can be done without significant loss in
accuracy. Our experiment results demonstrate that LXMERT can be effectively
pruned by 40%-60% in size with 3% loss in accuracy.Comment: To appear in The Fourth Annual West Coast NLP (WeCNLP) Summi