Opinions in the scientific domain can be divergent, leading to controversy or
consensus among reviewers. However, current opinion summarization datasets
mostly focus on product review domains, which do not account for this
variability under the assumption that the input opinions are non-controversial.
To address this gap, we propose the task of scientific opinion summarization,
where research paper reviews are synthesized into meta-reviews. To facilitate
this task, we introduce a new ORSUM dataset covering 10,989 paper meta-reviews
and 40,903 paper reviews from 39 conferences. Furthermore, we propose the
Checklist-guided Iterative Introspection (CGI2) approach, which breaks down
the task into several stages and iteratively refines the summary under the
guidance of questions from a checklist. We conclude that (1) human-written
summaries are not always reliable since many do not follow the guidelines, and
(2) the combination of task decomposition and iterative self-refinement shows
promising discussion involvement ability and can be applied to other complex
text generation using black-box LLM