Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Bowman, Samuel R.; Chen, Angelica; Cho, Kyunghyun; Padmakumar, Vishakh; Parrish, Alicia; Phang, Jason; Zhao, Chen

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Authors: Samuel R. Bowman
Angelica Chen
Kyunghyun Cho
Vishakh Padmakumar
Alicia Parrish
Jason Phang
Chen Zhao
Publication date: 17 July 2023
Publisher

Abstract

Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.Comment: Added GPT-4 result

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2305.14279

Last time updated on 25/05/2023