Instruction tuning represents a prevalent strategy employed by Multimodal
Large Language Models (MLLMs) to align with human instructions and adapt to new
tasks. Nevertheless, MLLMs encounter the challenge of adapting to users'
evolving knowledge and demands. Therefore, how to retain existing skills while
acquiring new knowledge needs to be investigated. In this paper, we present a
comprehensive benchmark, namely Continual Instruction tuNing (CoIN), to assess
existing MLLMs in the sequential instruction tuning paradigm. CoIN comprises 10
commonly used datasets spanning 8 task categories, ensuring a diverse range of
instructions and tasks. Besides, the trained model is evaluated from two
aspects: Instruction Following and General Knowledge, which assess the
alignment with human intention and knowledge preserved for reasoning,
respectively. Experiments on CoIN demonstrate that current powerful MLLMs still
suffer catastrophic forgetting, and the failure in intention alignment assumes
the main responsibility, instead of the knowledge forgetting. To this end, we
introduce MoELoRA to MLLMs which is effective to retain the previous
instruction alignment. Experimental results consistently illustrate the
forgetting decreased from this method on CoIN