3,416 research outputs found
KPEval: Towards Fine-grained Semantic-based Evaluation of Keyphrase Extraction and Generation Systems
Despite the significant advancements in keyphrase extraction and keyphrase
generation methods, the predominant approach for evaluation only relies on
exact matching with human references and disregards reference-free attributes.
This scheme fails to recognize systems that generate keyphrases that are
semantically equivalent to the references or keyphrases that have practical
utility. To better understand the strengths and weaknesses of different
keyphrase systems, we propose a comprehensive evaluation framework consisting
of six critical dimensions: naturalness, faithfulness, saliency, coverage,
diversity, and utility. For each dimension, we discuss the desiderata and
design semantic-based metrics that align with the evaluation objectives.
Rigorous meta-evaluation studies demonstrate that our evaluation strategy
correlates better with human preferences compared to a range of previously used
metrics. Using this framework, we re-evaluate 18 keyphrase systems and further
discover that (1) the best model differs in different dimensions, with
pre-trained language models achieving the best in most dimensions; (2) the
utility in downstream tasks does not always correlate well with reference-based
metrics; and (3) large language models exhibit a strong performance in
reference-free evaluation
How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?
Text-to-image generative models have achieved unprecedented success in
generating high-quality images based on natural language descriptions. However,
it is shown that these models tend to favor specific social groups when
prompted with neutral text descriptions (e.g., 'a photo of a lawyer').
Following Zhao et al. (2021), we study the effect on the diversity of the
generated images when adding ethical intervention that supports equitable
judgment (e.g., 'if all individuals can be a lawyer irrespective of their
gender') in the input prompts. To this end, we introduce an Ethical NaTural
Language Interventions in Text-to-Image GENeration (ENTIGEN) benchmark dataset
to evaluate the change in image generations conditional on ethical
interventions across three social axes -- gender, skin color, and culture.
Through ENTIGEN framework, we find that the generations from minDALL.E,
DALL.E-mini and Stable Diffusion cover diverse social groups while preserving
the image quality. Preliminary studies indicate that a large change in the
model predictions is triggered by certain phrases such as 'irrespective of
gender' in the context of gender bias in the ethical interventions. We release
code and annotated data at https://github.com/Hritikbansal/entigen_emnlp.Comment: 13 pages, 8 figures, 6 tables. Accepted as Oral Presentation at EMNLP
202
Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
Instruction tuning has emerged to enhance the capabilities of large language
models (LLMs) to comprehend instructions and generate appropriate responses.
Existing methods either manually annotate or employ LLM (e.g., GPT-series) to
generate data for instruction tuning. However, they often overlook associating
instructions with existing annotated datasets. In this paper, we propose
Dynosaur, a dynamic growth paradigm for the automatic curation of
instruction-tuning data. Based on the metadata of existing datasets, we use
LLMs to automatically construct instruction-tuning data by identifying relevant
data fields and generating appropriate instructions.
By leveraging the existing annotated datasets, Dynosaur offers several
advantages: 1) it reduces the API cost for generating instructions (e.g., it
costs less than $12 USD by calling GPT-3.5-turbo for generating 800K
instruction tuning samples; 2) it provides high-quality data for instruction
tuning (e.g., it performs better than Alpaca and Flan on Super-NI and Longform
with comparable data sizes); and 3) it supports the continuous improvement of
models by generating instruction-tuning data when a new annotated dataset
becomes available. We further investigate a continual learning scheme for
learning with the ever-growing instruction-tuning dataset, and demonstrate that
replaying tasks with diverse instruction embeddings not only helps mitigate
forgetting issues but generalizes to unseen tasks better.
Code and data are available at https://github.com/WadeYin9712/Dynosaur.Comment: EMNLP 2023. Code and data are available at
https://github.com/WadeYin9712/Dynosau
How Does Data Augmentation Affect Privacy in Machine Learning?
It is observed in the literature that data augmentation can significantly
mitigate membership inference (MI) attack. However, in this work, we challenge
this observation by proposing new MI attacks to utilize the information of
augmented data. MI attack is widely used to measure the model's information
leakage of the training set. We establish the optimal membership inference when
the model is trained with augmented data, which inspires us to formulate the MI
attack as a set classification problem, i.e., classifying a set of augmented
instances instead of a single data point, and design input permutation
invariant features. Empirically, we demonstrate that the proposed approach
universally outperforms original methods when the model is trained with data
augmentation. Even further, we show that the proposed approach can achieve
higher MI attack success rates on models trained with some data augmentation
than the existing methods on models trained without data augmentation. Notably,
we achieve a 70.1% MI attack success rate on CIFAR10 against a wide residual
network while the previous best approach only attains 61.9%. This suggests the
privacy risk of models trained with data augmentation could be largely
underestimated.Comment: AAAI Conference on Artificial Intelligence (AAAI-21). Source code
available at: https://github.com/dayu11/MI_with_D
Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs
We introduce Lumos, a novel framework for training language agents that
employs a unified data format and a modular architecture based on open-source
large language models (LLMs). Lumos consists of three distinct modules:
planning, grounding, and execution. The planning module breaks down a task into
a series of high-level, tool-agnostic subgoals, which are then made specific by
the grounding module through a set of low-level actions. These actions are
subsequently executed by the execution module, utilizing a range of
off-the-shelf tools and APIs. In order to train these modules effectively,
high-quality annotations of subgoals and actions were collected and are made
available for fine-tuning open-source LLMs for various tasks such as complex
question answering, web tasks, and math problems. Leveraging this unified data
and modular design, Lumos not only achieves comparable or superior performance
to current, state-of-the-art agents, but also exhibits several key advantages:
(1) Lumos surpasses GPT-4/3.5-based agents in complex question answering and
web tasks, while equalling the performance of significantly larger LLM agents
on math tasks; (2) Lumos outperforms open-source agents created through
conventional training methods and those using chain-of-thoughts training; and
(3) Lumos is capable of effectively generalizing to unseen interactive tasks,
outperforming larger LLM-based agents and even exceeding performance of
specialized agents.Comment: Project website: https://allenai.github.io/lumos
- …