While large language models (LLMs) already achieve strong performance on
standard generic summarization benchmarks, their performance on more complex
summarization task settings is less studied. Therefore, we benchmark LLMs on
instruction controllable text summarization, where the model input consists of
both a source article and a natural language requirement for the desired
summary characteristics. To this end, we curate an evaluation-only dataset for
this task setting and conduct human evaluation on 5 LLM-based summarization
systems. We then benchmark LLM-based automatic evaluation for this task with 4
different evaluation protocols and 11 LLMs, resulting in 40 evaluation methods
in total. Our study reveals that instruction controllable text summarization
remains a challenging task for LLMs, since (1) all LLMs evaluated still make
factual and other types of errors in their summaries; (2) all LLM-based
evaluation methods cannot achieve a strong alignment with human annotators when
judging the quality of candidate summaries; (3) different LLMs show large
performance gaps in summary generation and evaluation. We make our collected
benchmark, InstruSum, publicly available to facilitate future research in this
direction.Comment: GitHub Repo: https://github.com/yale-nlp/InstruSu