Search CORE

1 research outputs found

PsyBench: a balanced and in-depth Psychological Chinese Evaluation Benchmark for Foundation Models

Author: He Hongliang
He Shuyuan
Lan Zhenzhong
Li Anqi
Ma Lizhi
Qiu Huachuan
Song Nirui
Zhang Junlei
Zhang Shuai
Publication venue
Publication date: 16/11/2023
Field of study

As Large Language Models (LLMs) are becoming prevalent in various fields, there is an urgent need for improved NLP benchmarks that encompass all the necessary knowledge of individual discipline. Many contemporary benchmarks for foundational models emphasize a broad range of subjects but often fall short in presenting all the critical subjects and encompassing necessary professional knowledge of them. This shortfall has led to skewed results, given that LLMs exhibit varying performance across different subjects and knowledge areas. To address this issue, we present psybench, the first comprehensive Chinese evaluation suite that covers all the necessary knowledge required for graduate entrance exams. psybench offers a deep evaluation of a model's strengths and weaknesses in psychology through multiple-choice questions. Our findings show significant differences in performance across different sections of a subject, highlighting the risk of skewed results when the knowledge in test sets is not balanced. Notably, only the ChatGPT model reaches an average accuracy above

70\%

, indicating that there is still plenty of room for improvement. We expect that psybench will help to conduct thorough evaluations of base models' strengths and weaknesses and assist in practical application in the field of psychology

arXiv.org e-Print Archive