Code Large Language Models (Code LLMs) are being increasingly employed in
real-life applications, so evaluating them is critical. While the general
accuracy of Code LLMs on individual tasks has been extensively evaluated, their
self-consistency across different tasks is overlooked. Intuitively, a
trustworthy model should be self-consistent when generating natural language
specifications for its own code and generating code for its own specifications.
Failure to preserve self-consistency reveals a lack of understanding of the
shared semantics underlying natural language and programming language, and
therefore undermines the trustworthiness of a model. In this paper, we first
formally define the self-consistency of Code LLMs and then design a framework,
IdentityChain, which effectively and efficiently evaluates the self-consistency
and general accuracy of a model at the same time. We study eleven Code LLMs and
show that they fail to preserve self-consistency, which is indeed a distinct
aspect from general accuracy. Furthermore, we show that IdentityChain can be
used as a model debugging tool to expose weaknesses of Code LLMs by
demonstrating three major weaknesses that we identify in current models using
IdentityChain. Our code is available at
https://github.com/marcusm117/IdentityChain.Comment: Code available at https://github.com/marcusm117/IdentityChai