S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large
  Language Models

He, Shizhu; Huang, Yiming; Lei, Fangyu; Liu, Kang; Liu, Qian; Zhao, Jun

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

Authors: Shizhu He
Yiming Huang
Fangyu Lei
Kang Liu
Qian Liu
Jun Zhao
Publication date: 23 October 2023
Publisher

Abstract

The rapid development of Large Language Models (LLMs) has led to great strides in model capabilities like reasoning and long-context understanding. However, as LLMs are able to process longer contexts, it becomes more challenging to evaluate whether they have acquired certain capabilities, since the length of text (e.g., 100K tokens) they can process far exceeds what humans can reliably assess in a reasonable duration. In this paper, we propose using complex synthetic tasks as a proxy evaluation method, and present S3Eval, a Synthetic, Scalable, Systematic evaluation suite for LLMs evaluation. As a synthetic benchmark, S3Eval enables the creation of any number of evaluation examples that are theoretically invisible to LLMs, mitigating the test set contamination issue. The synthetic nature of S3Eval provides users full control over the dataset, allowing them to systematically probe LLM capabilities by scaling text length and varying task difficulty across diverse scenarios. The strong correlation between S3Eval performance and scores of real-world benchmarks like Big-Bench Hard (BBH) demonstrates the soundness of using S3Eval for evaluation of LLMs. The in-depth analysis also uncover additional insights, including performance drop when the answer is sparsely distributed or located in the middle context, as well as some counter-intuitive trends of model performance.Comment: Work in progres

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2310.15147

Last time updated on 16/01/2024