Beyond Accuracy: Evaluating Self-Consistency of Code Large Language
  Models with IdentityChain

Buratti, Luca; Ding, Yangruibo; Jana, Suman; Kaiser, Gail; Min, Marcus J.; Pujar, Saurabh; Ray, Baishakhi

Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

Authors: Luca Buratti
Yangruibo Ding
Suman Jana
Gail Kaiser
Marcus J. Min
Saurabh Pujar
Baishakhi Ray
Publication date: 21 October 2023
Publisher

Abstract

Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the general accuracy of Code LLMs on individual tasks has been extensively evaluated, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications for its own code and generating code for its own specifications. Failure to preserve self-consistency reveals a lack of understanding of the shared semantics underlying natural language and programming language, and therefore undermines the trustworthiness of a model. In this paper, we first formally define the self-consistency of Code LLMs and then design a framework, IdentityChain, which effectively and efficiently evaluates the self-consistency and general accuracy of a model at the same time. We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from general accuracy. Furthermore, we show that IdentityChain can be used as a model debugging tool to expose weaknesses of Code LLMs by demonstrating three major weaknesses that we identify in current models using IdentityChain. Our code is available at https://github.com/marcusm117/IdentityChain.Comment: Code available at https://github.com/marcusm117/IdentityChai

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2310.14053

Last time updated on 16/01/2024