8 research outputs found
ConstitutionMaker: Interactively Critiquing Large Language Models by Converting Feedback into Principles
Large language model (LLM) prompting is a promising new approach for users to
create and customize their own chatbots. However, current methods for steering
a chatbot's outputs, such as prompt engineering and fine-tuning, do not support
users in converting their natural feedback on the model's outputs to changes in
the prompt or model. In this work, we explore how to enable users to
interactively refine model outputs through their feedback, by helping them
convert their feedback into a set of principles (i.e. a constitution) that
dictate the model's behavior. From a formative study, we (1) found that users
needed support converting their feedback into principles for the chatbot and
(2) classified the different principle types desired by users. Inspired by
these findings, we developed ConstitutionMaker, an interactive tool for
converting user feedback into principles, to steer LLM-based chatbots. With
ConstitutionMaker, users can provide either positive or negative feedback in
natural language, select auto-generated feedback, or rewrite the chatbot's
response; each mode of feedback automatically generates a principle that is
inserted into the chatbot's prompt. In a user study with 14 participants, we
compare ConstitutionMaker to an ablated version, where users write their own
principles. With ConstitutionMaker, participants felt that their principles
could better guide the chatbot, that they could more easily convert their
feedback into principles, and that they could write principles more
efficiently, with less mental demand. ConstitutionMaker helped users identify
ways to improve the chatbot, formulate their intuitive responses to the model
into feedback, and convert this feedback into specific and clear principles.
Together, these findings inform future tools that support the interactive
critiquing of LLM outputs
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
Automatic side-by-side evaluation has emerged as a promising approach to
evaluating the quality of responses from large language models (LLMs). However,
analyzing the results from this evaluation approach raises scalability and
interpretability challenges. In this paper, we present LLM Comparator, a novel
visual analytics tool for interactively analyzing results from automatic
side-by-side evaluation. The tool supports interactive workflows for users to
understand when and why a model performs better or worse than a baseline model,
and how the responses from two models are qualitatively different. We
iteratively designed and developed the tool by closely working with researchers
and engineers at a large technology company. This paper details the user
challenges we identified, the design and development of the tool, and an
observational study with participants who regularly evaluate their models
GEMv2 : Multilingual NLG benchmarking in a single line of code
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe
GEMv2 : Multilingual NLG benchmarking in a single line of code
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe