Search CORE

8 research outputs found

ConstitutionMaker: Interactively Critiquing Large Language Models by Converting Feedback into Principles

Author: Cai Carrie J.
Donsbach Aaron
Goyal Nitesh
Petridis Savvas
Pushkarna Mahima
Terry Michael
Wedin Ben
Wexler James
Publication venue
Publication date: 23/10/2023
Field of study

Large language model (LLM) prompting is a promising new approach for users to create and customize their own chatbots. However, current methods for steering a chatbot's outputs, such as prompt engineering and fine-tuning, do not support users in converting their natural feedback on the model's outputs to changes in the prompt or model. In this work, we explore how to enable users to interactively refine model outputs through their feedback, by helping them convert their feedback into a set of principles (i.e. a constitution) that dictate the model's behavior. From a formative study, we (1) found that users needed support converting their feedback into principles for the chatbot and (2) classified the different principle types desired by users. Inspired by these findings, we developed ConstitutionMaker, an interactive tool for converting user feedback into principles, to steer LLM-based chatbots. With ConstitutionMaker, users can provide either positive or negative feedback in natural language, select auto-generated feedback, or rewrite the chatbot's response; each mode of feedback automatically generates a principle that is inserted into the chatbot's prompt. In a user study with 14 participants, we compare ConstitutionMaker to an ablated version, where users write their own principles. With ConstitutionMaker, participants felt that their principles could better guide the chatbot, that they could more easily convert their feedback into principles, and that they could write principles more efficiently, with less mental demand. ConstitutionMaker helped users identify ways to improve the chatbot, formulate their intuitive responses to the model into feedback, and convert this feedback into specific and clear principles. Together, these findings inform future tools that support the interactive critiquing of LLM outputs

arXiv.org e-Print Archive

LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models

Author: Chang Minsuk
Dixon Lucas
Kahng Minsuk
Kallarackal Krystal
Liu Michael Xieyang
Pushkarna Mahima
Reif Emily
Tenney Ian
Terry Michael
Wexler James
Publication venue
Publication date: 16/02/2024
Field of study

Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at a large technology company. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models

arXiv.org e-Print Archive

GEMv2 : Multilingual NLG benchmarking in a single line of code

Author: Adewumi Tosin
Ammanamanch Pawan Sasanka
Bhagavatula Chandra
Bhattacharjee Abhik
Bohnet Bernd
Cahyawijaya Samuel
Cardenas Ronald
Chim Jenny
Clark Elizabeth
Clive Jordan
Creutz Mathias
Daheim Nico
Deutsch Daniel
Dhole Kaustubh
Durmus Esin
Dusek Ondrej
Garbacea Cristina
Gehrmann Sebastian
Ginter Filip
Gkatzia Dimitra
Hasan Tahmid
Hayashi Hiroaki
Hou Yufang
Jernite Yacine
Jin Di
Jolly Shailza
Juraska Juraj
Kamal Eddine Moussa
Kanerva Jenna
Kriz Reno
Ladhak Faisal
Liu Yixin
Madaan Aman
Mahamood Saad
Mahendiran Abinaya
Maynez Joshua
McMillan-Major Angelina
Mille Simon
Montella Sebastien
Nikolaev Vitaly
Novikova Jekaterina
Osei Salomey
Papangelis Alexandros
Perez-Beltrachini Laura
Pu Liang Paul
Puduppully Ratish
Pushkarna Mahima
Radev Dragomir
Raghavi Chandu Khyathi
Raheja Vipul
Raunak Vikas
Ribeiro Leonardo F. R.
Sang Yisi
Sanjay Kale Mihir
Sedoc João
Shahriyar Rifat
Shen Tianhao
Shvets Anna
Strobelt Hendrik
Subramani Nishant
Thomson Craig
Tsai Vivian
Tunstall Lewis
Upadhyay Ashish
Wang Alex
Wang Dakuo
White Michael
Wilie Bryan
Winata Genta Indra
Xiong Deyi
Xu Ying
Yao Bingsheng
You Chaobin
Zhang Li
Zhou Jiawei
Zhu Qi
Štajner Sanja
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2022
Field of study

Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

GEMv2 : Multilingual NLG benchmarking in a single line of code

Author: Adewumi Tosin
Ammanamanch Pawan Sasanka
Bhagavatula Chandra
Bhattacharjee Abhik
Bohnet Bernd
Cahyawijaya Samuel
Cardenas Ronald
Chim Jenny
Clark Elizabeth
Clive Jordan
Creutz Mathias
Daheim Nico
Deutsch Daniel
Dhole Kaustubh
Durmus Esin
Dusek Ondrej
Garbacea Cristina
Gehrmann Sebastian
Ginter Filip
Gkatzia Dimitra
Hasan Tahmid
Hayashi Hiroaki
Hou Yufang
Jernite Yacine
Jin Di
Jolly Shailza
Juraska Juraj
Kamal Eddine Moussa
Kanerva Jenna
Kriz Reno
Ladhak Faisal
Liu Yixin
Madaan Aman
Mahamood Saad
Mahendiran Abinaya
Maynez Joshua
McMillan-Major Angelina
Mille Simon
Montella Sebastien
Nikolaev Vitaly
Novikova Jekaterina
Osei Salomey
Papangelis Alexandros
Perez-Beltrachini Laura
Pu Liang Paul
Puduppully Ratish
Pushkarna Mahima
Radev Dragomir
Raghavi Chandu Khyathi
Raheja Vipul
Raunak Vikas
Ribeiro Leonardo F. R.
Sang Yisi
Sanjay Kale Mihir
Sedoc João
Shahriyar Rifat
Shen Tianhao
Shvets Anna
Strobelt Hendrik
Subramani Nishant
Thomson Craig
Tsai Vivian
Tunstall Lewis
Upadhyay Ashish
Wang Alex
Wang Dakuo
White Michael
Wilie Bryan
Winata Genta Indra
Xiong Deyi
Xu Ying
Yao Bingsheng
You Chaobin
Zhang Li
Zhou Jiawei
Zhu Qi
Štajner Sanja
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2022
Field of study

Aberdeen University Research

Biblio at Institute of Formal and Applied Linguistics

Helsingin yliopiston digitaalinen arkisto