12 research outputs found
SMRT Chatbots: Improving Non-Task-Oriented Dialog with Simulated Multiple Reference Training
Non-task-oriented dialog models suffer from poor quality and non-diverse
responses. To overcome limited conversational data, we apply Simulated Multiple
Reference Training (SMRT; Khayrallah et al., 2020), and use a paraphraser to
simulate multiple responses per training prompt. We find SMRT improves over a
strong Transformer baseline as measured by human and automatic quality scores
and lexical diversity. We also find SMRT is comparable to pretraining in human
evaluation quality, and outperforms pretraining on automatic quality and
lexical diversity, without requiring related-domain dialog data.Comment: EMNLP 2020 Camera Read
ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing
Given the rapid ascent of large language models (LLMs), we study the
question: (How) can large language models help in reviewing of scientific
papers or proposals? We first conduct some pilot studies where we find that (i)
GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly,
OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to
identify errors) outperforms prompting to simply write a review. With these
insights, we study the use of LLMs (specifically, GPT-4) for three tasks:
1. Identifying errors: We construct 13 short computer science papers each
with a deliberately inserted error, and ask the LLM to check for the
correctness of these papers. We observe that the LLM finds errors in 7 of them,
spanning both mathematical and conceptual errors.
2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist
questions in the respective sections of 15 NeurIPS 2022 papers. We find that
across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy.
3. Choosing the "better" paper: We generate 10 pairs of abstracts,
deliberately designing each pair in such a way that one abstract was clearly
superior than the other. The LLM, however, struggled to discern these
relatively straightforward distinctions accurately, committing errors in its
evaluations for 6 out of the 10 pairs.
Based on these experiments, we think that LLMs have a promising use as
reviewing assistants for specific reviewing tasks, but not (yet) for complete
evaluations of papers or proposals