Despite the rapid development of video Large Language Models (LLMs), a
comprehensive evaluation is still absent. In this paper, we introduce a unified
evaluation that encompasses multiple video tasks, including captioning,
question and answering, retrieval, and action recognition. In addition to
conventional metrics, we showcase how GPT-based evaluation can match human-like
performance in assessing response quality across multiple aspects. We propose a
simple baseline: Video-LLaVA, which uses a single linear projection and
outperforms existing video LLMs. Finally, we evaluate video LLMs beyond
academic datasets, which show encouraging recognition and reasoning
capabilities in driving scenarios with only hundreds of video-instruction pairs
for fine-tuning. We hope our work can serve as a unified evaluation for video
LLMs, and help expand more practical scenarios. The evaluation code will be
available soon