84 research outputs found
AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap
The rise of powerful large language models (LLMs) brings about tremendous
opportunities for innovation but also looming risks for individuals and society
at large. We have reached a pivotal moment for ensuring that LLMs and
LLM-infused applications are developed and deployed responsibly. However, a
central pillar of responsible AI -- transparency -- is largely missing from the
current discourse around LLMs. It is paramount to pursue new approaches to
provide transparency for LLMs, and years of research at the intersection of AI
and human-computer interaction (HCI) highlight that we must do so with a
human-centered perspective: Transparency is fundamentally about supporting
appropriate human understanding, and this understanding is sought by different
stakeholders with different goals in different contexts. In this new era of
LLMs, we must develop and design approaches to transparency by considering the
needs of stakeholders in the emerging LLM ecosystem, the novel types of
LLM-infused applications being built, and the new usage patterns and challenges
around LLMs, all while building on lessons learned about how people process,
interact with, and make use of information. We reflect on the unique challenges
that arise in providing transparency for LLMs, along with lessons learned from
HCI and responsible AI research that has taken a human-centered perspective on
AI transparency. We then lay out four common approaches that the community has
taken to achieve transparency -- model reporting, publishing evaluation
results, providing explanations, and communicating uncertainty -- and call out
open questions around how these approaches may or may not be applied to LLMs.
We hope this provides a starting point for discussion and a useful roadmap for
future research
Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective
We address the fundamental challenge in Natural Language Generation (NLG)
model evaluation, the design and validation of evaluation metrics. Recognizing
the limitations of existing metrics and issues with human judgment, we propose
using measurement theory, the foundation of test design, as a framework for
conceptualizing and evaluating the validity and reliability of NLG evaluation
metrics. This approach offers a systematic method for defining "good" metrics,
developing robust metrics, and assessing metric performance. In this paper, we
introduce core concepts in measurement theory in the context of NLG evaluation
and key methods to evaluate the performance of NLG metrics. Through this
framework, we aim to promote the design, evaluation, and interpretation of
valid and reliable metrics, ultimately contributing to the advancement of
robust and effective NLG models in real-world settings
Effect of Confidence and Explanation on Accuracy and Trust Calibration in AI-Assisted Decision Making
Today, AI is being increasingly used to help human experts make decisions in
high-stakes scenarios. In these scenarios, full automation is often
undesirable, not only due to the significance of the outcome, but also because
human experts can draw on their domain knowledge complementary to the model's
to ensure task success. We refer to these scenarios as AI-assisted decision
making, where the individual strengths of the human and the AI come together to
optimize the joint decision outcome. A key to their success is to appropriately
\textit{calibrate} human trust in the AI on a case-by-case basis; knowing when
to trust or distrust the AI allows the human expert to appropriately apply
their knowledge, improving decision outcomes in cases where the model is likely
to perform poorly. This research conducts a case study of AI-assisted decision
making in which humans and AI have comparable performance alone, and explores
whether features that reveal case-specific model information can calibrate
trust and improve the joint performance of the human and AI. Specifically, we
study the effect of showing confidence score and local explanation for a
particular prediction. Through two human experiments, we show that confidence
score can help calibrate people's trust in an AI model, but trust calibration
alone is not sufficient to improve AI-assisted decision making, which may also
depend on whether the human can bring in enough unique knowledge to complement
the AI's errors. We also highlight the problems in using local explanation for
AI-assisted decision making scenarios and invite the research community to
explore new approaches to explainability for calibrating human trust in AI
- …