In natural language processing, transformer-based large language models
(LLMs) like GPT-x models developed by OpenAI have revolutionized the landscape.
Despite their impressive capabilities, these models often encounter challenges
when handling tasks that differ from their training data, resulting in
compromised performance. To address this, few-shot learning has emerged as a
valuable technique, allowing LLMs to adapt with minimal task-specific data. One
innovative strategy, known as Chain-of-Thought Prompting (CoT), has been
introduced to guide LLMs in revealing cognitive processes during multi-step
reasoning. In this paper, we propose Code Chain-of-Thought~(CodeCoT), which
consists of two components: the Vanilla CodeCoT and the Self-exam CodeCoT. The
latter incorporates self-examination, empowering the model to iteratively
generate code, formulate test cases, and refine its outputs. Specifically, the
process entails the generation of test examples by the model corresponding to
the code it is tasked to implement. If it fails on the test examples, then it
regenerates the code based on the erroneous code and associated error types.
Through comprehensive experiments, we observed that both techniques
significantly enhance code generation accuracy across various LLM variants. Our
evaluation results reveal that CodeCoT improves the code generation
effectiveness, including an unprecedented pass@1 accuracy of 79.27\% using the
Self-exam CodeCoT approach on the gpt-3.5-turbo-0613 model in the HumanEval
dataset