Recently, the visual programming framework (VisProg) has emerged as a
significant framework for executing compositional visual tasks due to its
interpretability and flexibility. However, the performance of VisProg on
specific Visual Reasoning (VR) tasks is markedly inferior compared to
well-trained task-specific models since its employed visual sub-modules have
limited generalization capabilities. Due to the non-differentiability of
VisProg, it is quite challenging to improve these visual sub-modules within
VisProg for the specific VR task while maintaining their generalizability on
the un-seen tasks. Attempt to overcome these difficulties, we propose CLVP, a
Continuous Learning paradigm for VisProg across various visual reasoning tasks.
Specifically, our CLVP distills the capabilities of well-trained task-specific
models into the visual sub-modules in a stepwise and anti-forgetting manner.
This can continually improve the performance of VisProg on multiple visual
tasks while preserving the flexibility of VisProg. Extensive and comprehensive
experimental results demonstrate that our CLVP obtains significant performance
gains on specific VR benchmarks, i.e., GQA (+1.4%) and NLVRv2 (+5.6%), compared
to the VisProg baseline, and also maintains a promising generalizability for VR
on un-seen and previous learned tasks