The rapid development of Large Language Models (LLMs) has led to great
strides in model capabilities like reasoning and long-context understanding.
However, as LLMs are able to process longer contexts, it becomes more
challenging to evaluate whether they have acquired certain capabilities, since
the length of text (e.g., 100K tokens) they can process far exceeds what humans
can reliably assess in a reasonable duration. In this paper, we propose using
complex synthetic tasks as a proxy evaluation method, and present S3Eval, a
Synthetic, Scalable, Systematic evaluation suite for LLMs evaluation. As a
synthetic benchmark, S3Eval enables the creation of any number of evaluation
examples that are theoretically invisible to LLMs, mitigating the test set
contamination issue. The synthetic nature of S3Eval provides users full control
over the dataset, allowing them to systematically probe LLM capabilities by
scaling text length and varying task difficulty across diverse scenarios. The
strong correlation between S3Eval performance and scores of real-world
benchmarks like Big-Bench Hard (BBH) demonstrates the soundness of using S3Eval
for evaluation of LLMs. The in-depth analysis also uncover additional insights,
including performance drop when the answer is sparsely distributed or located
in the middle context, as well as some counter-intuitive trends of model
performance.Comment: Work in progres