205 research outputs found
AugCSE: contrastive sentence embedding with diverse augmentations
Data augmentation techniques have been proven useful in many applications in NLP fields. Most augmentations are task-specific, and cannot be used as a general-purpose tool. In our work, we present AugCSE, a unified framework to utilize diverse sets of data augmentations to achieve a better, general-purpose, sentence embedding model. Building upon the latest sentence embedding models, our approach uses a simple antagonistic discriminator that differentiates the augmentation types. With the finetuning objective borrowed from domain adaptation, we show that diverse augmentations, which often lead to conflicting contrastive signals, can be tamed to produce a better and more robust sentence representation. Our methods achieve state-of-the-art results on downstream transfer tasks and perform competitively on semantic textual similarity tasks, using only unsupervised data.000000000000000000000000000000000000000000000000000000010241 - University of California, Berkeleyhttps://aclanthology.org/2022.aacl-main.30/First author draf
Multi-grained Evidence Inference for Multi-choice Reading Comprehension
Multi-choice Machine Reading Comprehension (MRC) is a major and challenging
task for machines to answer questions according to provided options. Answers in
multi-choice MRC cannot be directly extracted in the given passages, and
essentially require machines capable of reasoning from accurate extracted
evidence. However, the critical evidence may be as simple as just one word or
phrase, while it is hidden in the given redundant, noisy passage with multiple
linguistic hierarchies from phrase, fragment, sentence until the entire
passage. We thus propose a novel general-purpose model enhancement which
integrates multi-grained evidence comprehensively, named Multi-grained evidence
inferencer (Mugen), to make up for the inability. Mugen extracts three
different granularities of evidence: coarse-, middle- and fine-grained
evidence, and integrates evidence with the original passages, achieving
significant and consistent performance improvement on four multi-choice MRC
benchmarks.Comment: Accepted by TASLP 2023, vol. 31, pp. 3896-390
FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning
LLMs have demonstrated great capabilities in various NLP tasks. Different
entities can further improve the performance of those LLMs on their specific
downstream tasks by fine-tuning LLMs. When several entities have similar
interested tasks, but their data cannot be shared because of privacy concerns
regulations, federated learning (FL) is a mainstream solution to leverage the
data of different entities. However, fine-tuning LLMs in federated learning
settings still lacks adequate support from existing FL frameworks because it
has to deal with optimizing the consumption of significant communication and
computational resources, data preparation for different tasks, and distinct
information protection demands. This paper first discusses these challenges of
federated fine-tuning LLMs, and introduces our package FS-LLM as a main
contribution, which consists of the following components: (1) we build an
end-to-end benchmarking pipeline, automizing the processes of dataset
preprocessing, federated fine-tuning execution, and performance evaluation on
federated LLM fine-tuning; (2) we provide comprehensive federated
parameter-efficient fine-tuning algorithm implementations and versatile
programming interfaces for future extension in FL scenarios with low
communication and computation costs, even without accessing the full model; (3)
we adopt several accelerating and resource-efficient operators for fine-tuning
LLMs with limited resources and the flexible pluggable sub-routines for
interdisciplinary study. We conduct extensive experiments to validate the
effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art
parameter-efficient fine-tuning algorithms in FL settings, which also yields
valuable insights into federated fine-tuning LLMs for the research community.
To facilitate further research and adoption, we release FS-LLM at
https://github.com/alibaba/FederatedScope/tree/llm.Comment: Source code: https://github.com/alibaba/FederatedScope/tree/ll
GameQA: Gamified Mobile App Platform for Building Multiple-Domain Question-Answering Datasets
Peer reviewe
CoLT5: Faster Long-Range Transformers with Conditional Computation
Many natural language processing tasks benefit from long inputs, but
processing long documents with Transformers is expensive -- not only due to
quadratic attention complexity but also from applying feedforward and
projection layers to every token. However, not all tokens are equally
important, especially for longer documents. We propose CoLT5, a long-input
Transformer model that builds on this intuition by employing conditional
computation, devoting more resources to important tokens in both feedforward
and attention layers. We show that CoLT5 achieves stronger performance than
LongT5 with much faster training and inference, achieving SOTA on the
long-input SCROLLS benchmark. Moreover, CoLT5 can effectively and tractably
make use of extremely long inputs, showing strong gains up to 64k input length.Comment: Added CoDA reference and minor edits to clarify routin
Causal Reasoning of Entities and Events in Procedural Texts
Entities and events are crucial to natural language reasoning and common in
procedural texts. Existing work has focused either exclusively on entity state
tracking (e.g., whether a pan is hot) or on event reasoning (e.g., whether one
would burn themselves by touching the pan), while these two tasks are often
causally related. We propose CREPE, the first benchmark on causal reasoning of
event plausibility and entity states. We show that most language models,
including GPT-3, perform close to chance at .35 F1, lagging far behind human at
.87 F1. We boost model performance to .59 F1 by creatively representing events
as programming languages while prompting language models pretrained on code. By
injecting the causal relations between entities and events as intermediate
reasoning steps in our representation, we further boost the performance to .67
F1. Our findings indicate not only the challenge that CREPE brings for language
models, but also the efficacy of code-like prompting combined with
chain-of-thought prompting for multihop event reasoning.Comment: In Findings of EACL 202
Leveraging Feedback in Conversational Question Answering Systems
172 p.Tesi honen helburua martxan jarri eta geroko sistemek gizakiekin duten elkarregina erabiltzeada, gizakien feedbacka sistementzako ikasketa eta egokitzapen seinale bezala erabiliz.Elkarrizketa sistemek martxan jartzerakoan jasaten duten domeinu aldaketan jartzen dugufokua. Helburu honetarako, feedback bitar esplizituaren kasua aztertzen dugu, hau baitagizakientzat feedbacka emateko seinale errazena.Sistemak martxan jarri eta gero hobetzeko, lehenik eta behin DoQA izeneko galdera-erantzunmotako elkarriketez osatutako datu multzo bat eraiki dugu. Datu multzo honekcrowdsourcing bidez jasotako 2.437 dialogo ditu. Aurreko lanekin konparatuz gero, DoQAkbenetazko informazio beharrak islatzen ditu, datu multzo barneko elkarrizketak naturalagoaketa koherenteagoak izanik. Datu multzo sortu eta gero, feedback-weighted learning (FWL)izeneko algoritmo bat diseinatu dugu, feedback bitarra bakarrik erabiliz aurretikentrenatutako sistema gainbegiratu bat hobetzeko gai dena. Azkenik, algoritmo honen mugakaztertzen ditugu jasotako feedbacka zaratatsua den kasuetarako eta FWL moldatzen dugueszenatoki zaratsuari aurre egiteko. Kasu honetan lortzen ditugun emaitza negatiboakerakusten dute erabiltzaileetatik jasotako feedback zaratsua modelatzearen erronka, hauebaztea oraindik ikerkuntza galdera ireki bat delarik
AutoMix: Automatically Mixing Language Models
Large language models (LLMs) are now available in various sizes and
configurations from cloud API providers. While this diversity offers a broad
spectrum of choices, effectively leveraging the options to optimize
computational cost and performance remains challenging. In this work, we
present AutoMix, an approach that strategically routes queries to larger LMs,
based on the approximate correctness of outputs from a smaller LM. Central to
AutoMix is a few-shot self-verification mechanism, which estimates the
reliability of its own outputs without requiring training. Given that
verifications can be noisy, we employ a meta verifier in AutoMix to refine the
accuracy of these assessments. Our experiments using LLAMA2-13/70B, on five
context-grounded reasoning datasets demonstrate that AutoMix surpasses
established baselines, improving the incremental benefit per cost by up to 89%.
Our code and data are available at https://github.com/automix-llm/automix.Comment: The first two authors contributed equally. Work started and partly
done during Aman's internship at Google. This version adds results on mixing
3 models, and will be presented at the workshop on robustness of
zero/few-shot learning in foundation models, Neurips 202
Machine Reading Comprehension using Case-based Reasoning
We present an accurate and interpretable method for answer extraction in
machine reading comprehension that is reminiscent of case-based reasoning (CBR)
from classical AI. Our method (CBR-MRC) builds on the hypothesis that
contextualized answers to similar questions share semantic similarities with
each other. Given a target question, CBR-MRC retrieves a set of similar
questions from a memory of observed cases and predicts an answer by selecting
the span in the target context that is most similar to the contextualized
representations of answers in the retrieved cases. The semi-parametric nature
of our approach allows CBR-MRC to attribute a prediction to the specific set of
cases used during inference, making it a desirable choice for building reliable
and debuggable QA systems. We show that CBR-MRC achieves high test accuracy
comparable with large reader models, outperforming baselines by 11.5 and 8.4 EM
on NaturalQuestions and NewsQA, respectively. Further, we also demonstrate the
ability of CBR-MRC in identifying not just the correct answer tokens but also
the span with the most relevant supporting evidence. Lastly, we observe that
contexts for certain question types show higher lexical diversity than others
and find CBR-MRC to be robust to these variations while performance using
fully-parametric methods drops.Comment: 9 pages, 2 figure
- …