12 research outputs found
AttributionBench: How Hard is Automatic Attribution Evaluation?
Modern generative search engines enhance the reliability of large language
model (LLM) responses by providing cited evidence. However, evaluating the
answer's attribution, i.e., whether every claim within the generated responses
is fully supported by its cited evidence, remains an open problem. This
verification, traditionally dependent on costly human evaluation, underscores
the urgent need for automatic attribution evaluation methods. To bridge the gap
in the absence of standardized benchmarks for these methods, we present
AttributionBench, a comprehensive benchmark compiled from various existing
attribution datasets. Our extensive experiments on AttributionBench reveal the
challenges of automatic attribution evaluation, even for state-of-the-art LLMs.
Specifically, our findings show that even a fine-tuned GPT-3.5 only achieves
around 80% macro-F1 under a binary classification formulation. A detailed
analysis of more than 300 error cases indicates that a majority of failures
stem from the model's inability to process nuanced information, and the
discrepancy between the information the model has access to and that human
annotators do
A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents
Language agents powered by large language models (LLMs) have seen exploding
development. Their capability of using language as a vehicle for thought and
communication lends an incredible level of flexibility and versatility. People
have quickly capitalized on this capability to connect LLMs to a wide range of
external components and environments: databases, tools, the Internet, robotic
embodiment, etc. Many believe an unprecedentedly powerful automation technology
is emerging. However, new automation technologies come with new safety risks,
especially for intricate systems like language agents. There is a surprisingly
large gap between the speed and scale of their development and deployment and
our understanding of their safety risks. Are we building a house of cards? In
this position paper, we present the first systematic effort in mapping
adversarial attacks against language agents. We first present a unified
conceptual framework for agents with three major components: Perception, Brain,
and Action. Under this framework, we present a comprehensive discussion and
propose 12 potential attack scenarios against different components of an agent,
covering different attack strategies (e.g., input manipulation, adversarial
demonstrations, jailbreaking, backdoors). We also draw connections to
successful attack strategies previously applied to LLMs. We emphasize the
urgency to gain a thorough understanding of language agent risks before their
widespread deployment
In Search of the Long-Tail: Systematic Generation of Long-Tail Knowledge via Logical Rule Guided Search
Since large language models have approached human-level performance on many
tasks, it has become increasingly harder for researchers to find tasks that are
still challenging to the models. Failure cases usually come from the long-tail
distribution - data that an oracle language model could assign a probability on
the lower end of its distribution. Current methodology such as prompt
engineering or crowdsourcing are insufficient for creating long-tail examples
because humans are constrained by cognitive bias. We propose a
Logic-Induced-Knowledge-Search (LINK) framework for systematically generating
long-tail knowledge statements. Grounded by a symbolic rule, we search for
long-tail values for each variable of the rule by first prompting a LLM, then
verifying the correctness of the values with a critic, and lastly pushing for
the long-tail distribution with a reranker. With this framework we construct a
dataset, Logic-Induced-Long-Tail (LINT), consisting of 200 symbolic rules and
50K knowledge statements spanning across four domains. Human annotations find
that 84% of the statements in LINT are factually correct. In contrast, ChatGPT
and GPT4 struggle with directly generating long-tail statements under the
guidance of logic rules, each only getting 56% and 78% of their statements
correct. Moreover, their "long-tail" generations in fact fall into the higher
likelihood range, and thus are not really long-tail. Our findings suggest that
LINK is effective for generating data in the long-tail distribution while
enforcing quality. LINT can be useful for systematically evaluating LLMs'
capabilities in the long-tail distribution. We challenge the models with a
simple entailment classification task using samples from LINT. We find that
ChatGPT and GPT4's capability in identifying incorrect knowledge drop by ~3% in
the long-tail distribution compared to head distribution
Introducing v0.5 of the AI Safety Benchmark from MLCommons
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark
Introducing v0.5 of the AI Safety Benchmark from MLCommons
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark
RobustLR: Evaluating Robustness to Logical Perturbation in Deductive Reasoning
Transformers have been shown to be able to perform deductive reasoning on a
logical rulebase containing rules and statements written in English natural
language. While the progress is promising, it is currently unclear if these
models indeed perform logical reasoning by understanding the underlying logical
semantics in the language. To this end, we propose RobustLR, a suite of
evaluation datasets that evaluate the robustness of these models to minimal
logical edits in rulebases and some standard logical equivalence conditions. In
our experiments with RoBERTa and T5, we find that the models trained in prior
works do not perform consistently on the different perturbations in RobustLR,
thus showing that the models are not robust to the proposed logical
perturbations. Further, we find that the models find it especially hard to
learn logical negation and disjunction operators. Overall, using our evaluation
sets, we demonstrate some shortcomings of the deductive reasoning-based
language models, which can eventually help towards designing better models for
logical reasoning over natural language.Comment: 13 page
Logic-Induced-Long-Tail (LINT)
<p>Logic-Induced-Long-Tail (LINT) dataset for arxiv paper "IN SEARCH OF THE LONG-TAIL: SYSTEMATIC GENERATION OF LONG-TAIL KNOWLEDGE VIA LOGICAL RULE GUIDED SEARCH."</p>