1 research outputs found
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models
With the emergence of Large Language Models (LLMs), there has been a
significant improvement in the programming capabilities of models, attracting
growing attention from researchers. We propose CodeApex, a bilingual benchmark
dataset focusing on the programming comprehension and code generation abilities
of LLMs. CodeApex comprises three types of multiple-choice questions:
conceptual understanding, commonsense reasoning, and multi-hop reasoning,
designed to evaluate LLMs on programming comprehension tasks. Additionally,
CodeApex utilizes algorithmic questions and corresponding test cases to assess
the code quality generated by LLMs. We evaluate 14 state-of-the-art LLMs,
including both general-purpose and specialized models. GPT exhibits the best
programming capabilities, achieving approximate accuracies of 50% and 56% on
the two tasks, respectively. There is still significant room for improvement in
programming tasks. We hope that CodeApex can serve as a reference for
evaluating the coding capabilities of LLMs, further promoting their development
and growth. Datasets are released at https://github.com/APEXLAB/CodeApex.git.
CodeApex submission website is https://apex.sjtu.edu.cn/codeapex/.Comment: 21 page