10 research outputs found
In Search of the Long-Tail: Systematic Generation of Long-Tail Knowledge via Logical Rule Guided Search
Since large language models have approached human-level performance on many
tasks, it has become increasingly harder for researchers to find tasks that are
still challenging to the models. Failure cases usually come from the long-tail
distribution - data that an oracle language model could assign a probability on
the lower end of its distribution. Current methodology such as prompt
engineering or crowdsourcing are insufficient for creating long-tail examples
because humans are constrained by cognitive bias. We propose a
Logic-Induced-Knowledge-Search (LINK) framework for systematically generating
long-tail knowledge statements. Grounded by a symbolic rule, we search for
long-tail values for each variable of the rule by first prompting a LLM, then
verifying the correctness of the values with a critic, and lastly pushing for
the long-tail distribution with a reranker. With this framework we construct a
dataset, Logic-Induced-Long-Tail (LINT), consisting of 200 symbolic rules and
50K knowledge statements spanning across four domains. Human annotations find
that 84% of the statements in LINT are factually correct. In contrast, ChatGPT
and GPT4 struggle with directly generating long-tail statements under the
guidance of logic rules, each only getting 56% and 78% of their statements
correct. Moreover, their "long-tail" generations in fact fall into the higher
likelihood range, and thus are not really long-tail. Our findings suggest that
LINK is effective for generating data in the long-tail distribution while
enforcing quality. LINT can be useful for systematically evaluating LLMs'
capabilities in the long-tail distribution. We challenge the models with a
simple entailment classification task using samples from LINT. We find that
ChatGPT and GPT4's capability in identifying incorrect knowledge drop by ~3% in
the long-tail distribution compared to head distribution
Introducing v0.5 of the AI Safety Benchmark from MLCommons
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark
Introducing v0.5 of the AI Safety Benchmark from MLCommons
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark
RobustLR: Evaluating Robustness to Logical Perturbation in Deductive Reasoning
Transformers have been shown to be able to perform deductive reasoning on a
logical rulebase containing rules and statements written in English natural
language. While the progress is promising, it is currently unclear if these
models indeed perform logical reasoning by understanding the underlying logical
semantics in the language. To this end, we propose RobustLR, a suite of
evaluation datasets that evaluate the robustness of these models to minimal
logical edits in rulebases and some standard logical equivalence conditions. In
our experiments with RoBERTa and T5, we find that the models trained in prior
works do not perform consistently on the different perturbations in RobustLR,
thus showing that the models are not robust to the proposed logical
perturbations. Further, we find that the models find it especially hard to
learn logical negation and disjunction operators. Overall, using our evaluation
sets, we demonstrate some shortcomings of the deductive reasoning-based
language models, which can eventually help towards designing better models for
logical reasoning over natural language.Comment: 13 page
Logic-Induced-Long-Tail (LINT)
<p>Logic-Induced-Long-Tail (LINT) dataset for arxiv paper "IN SEARCH OF THE LONG-TAIL: SYSTEMATIC GENERATION OF LONG-TAIL KNOWLEDGE VIA LOGICAL RULE GUIDED SEARCH."</p>
Introducing v0.5 of the AI Safety Benchmark from MLCommons
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark