Search CORE

1 research outputs found

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

Author: Chai Huacan
Du Kounianhua
Fan Longteng
Fang Yuchen
Fu Lingyue
Lei Jiayi
Lin Jianghao
Liu Yifan
Luo Shuang
Qi Siyuan
Rui Renting
Wang Jingkuan
Yu Yong
Zhang Kangning
Zhang Weiming
Zhang Weinan
Publication venue
Publication date: 06/09/2023
Field of study

With the emergence of Large Language Models (LLMs), there has been a significant improvement in the programming capabilities of models, attracting growing attention from researchers. We propose CodeApex, a bilingual benchmark dataset focusing on the programming comprehension and code generation abilities of LLMs. CodeApex comprises three types of multiple-choice questions: conceptual understanding, commonsense reasoning, and multi-hop reasoning, designed to evaluate LLMs on programming comprehension tasks. Additionally, CodeApex utilizes algorithmic questions and corresponding test cases to assess the code quality generated by LLMs. We evaluate 14 state-of-the-art LLMs, including both general-purpose and specialized models. GPT exhibits the best programming capabilities, achieving approximate accuracies of 50% and 56% on the two tasks, respectively. There is still significant room for improvement in programming tasks. We hope that CodeApex can serve as a reference for evaluating the coding capabilities of LLMs, further promoting their development and growth. Datasets are released at https://github.com/APEXLAB/CodeApex.git. CodeApex submission website is https://apex.sjtu.edu.cn/codeapex/.Comment: 21 page

arXiv.org e-Print Archive