Learning to judge: LLMs designing and applying evaluation rubrics

Siro, C.N. (Clemencia); Aliannejadi, P. (Pourya); Aliannejadi, M. (Mohammad)

Search results>Research output from CWI's Institutional Repository

conference paper

oai:cwi.nl:36235

Learning to judge: LLMs designing and applying evaluation rubrics

Authors: C.N. (Clemencia) Siro
P. (Pourya) Aliannejadi
M. (Mohammad) Aliannejadi
Publication date: 24 March 2026
Publisher

Abstract

Large language models (LLMs) are increasingly used as evaluators for natural language generation, applying human-defined rubrics to assess system outputs. However, human rubrics are often static and misaligned with how models internally represent language quality. We introduce GER-Eval (Generating Evaluation Rubrics for Evaluation) to investigate whether LLMs can design and apply their own evaluation rubrics. We evaluate the semantic coherence and scoring reliability of LLM-defined criteria and their alignment with human criteria. LLMs reliably generate interpretable and task-aware evaluation dimensions and apply them within models, but their scoring reliability degrades in factual and knowledge-intensive settings. Closed-source models such as GPT4o achieve higher agreement and cross-model generalization than open-weight models such as Llama. Our findings position evaluation as a learned linguistic capability of LLMs, consistent within models but fragmented across them, and call for new methods that jointly model human and LLM evaluative language to improve reliability and interpretability

info:eu-repo/semantics/conferenceObject

Similar works

Full text

Open in the Core reader

Download PDF

CWI's Institutional Repository

oai:cwi.nl:36235

Last time updated on 25/02/2026

This paper was published in CWI's Institutional Repository.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.