7 research outputs found
Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data
Multi-Task Learning (MTL) networks have emerged as a promising method for
transferring learned knowledge across different tasks. However, MTL must deal
with challenges such as: overfitting to low resource tasks, catastrophic
forgetting, and negative task transfer, or learning interference. Often, in
Natural Language Processing (NLP), a separate model per task is needed to
obtain the best performance. However, many fine-tuning approaches are both
parameter inefficient, i.e., potentially involving one new model per task, and
highly susceptible to losing knowledge acquired during pretraining. We propose
a novel Transformer architecture consisting of a new conditional attention
mechanism as well as a set of task-conditioned modules that facilitate weight
sharing. Through this construction, we achieve more efficient parameter sharing
and mitigate forgetting by keeping half of the weights of a pretrained model
fixed. We also use a new multi-task data sampling strategy to mitigate the
negative effects of data imbalance across tasks. Using this approach, we are
able to surpass single task fine-tuning methods while being parameter and data
efficient (using around 66% of the data for weight updates). Compared to other
BERT Large methods on GLUE, our 8-task model surpasses other Adapter methods by
2.8% and our 24-task model outperforms by 0.7-1.0% models that use MTL and
single task fine-tuning. We show that a larger variant of our single multi-task
model approach performs competitively across 26 NLP tasks and yields
state-of-the-art results on a number of test and development sets. Our code is
publicly available at https://github.com/CAMTL/CA-MTL.Comment: ICLR 2021 (Reprint
On Conditional and Compositional Language Model Differentiable Prompting
Prompts have been shown to be an effective method to adapt a frozen
Pretrained Language Model (PLM) to perform well on downstream tasks. Prompts
can be represented by a human-engineered word sequence or by a learned
continuous embedding. In this work, we investigate conditional and
compositional differentiable prompting. We propose a new model, Prompt
Production System (PRopS), which learns to transform task instructions or input
metadata, into continuous prompts that elicit task-specific outputs from the
PLM. Our model uses a modular network structure based on our neural formulation
of Production Systems, which allows the model to learn discrete rules -- neural
functions that learn to specialize in transforming particular prompt input
patterns, making it suitable for compositional transfer learning and few-shot
learning. We present extensive empirical and theoretical analysis and show that
PRopS consistently surpasses other PLM adaptation techniques, and often
improves upon fully fine-tuned models, on compositional generalization tasks,
controllable summarization and multilingual translation, while needing fewer
trainable parameters.Comment: Accepted at International Joint Conference on Artificial Intelligence
(IJCAI) 202
On Extractive and Abstractive Neural Document Summarization with Transformer Language Models
We present a method to produce abstractive summaries of long documents that
exceed several thousand words via neural abstractive summarization. We perform
a simple extractive step before generating a summary, which is then used to
condition the transformer language model on relevant information before being
tasked with generating a summary. We show that this extractive step
significantly improves summarization results. We also show that this approach
produces more abstractive summaries compared to prior work that employs a copy
mechanism while still achieving higher rouge scores. Note: The abstract above
was not written by the authors, it was generated by one of the models presented
in this paper
Using Graph Algorithms to Pretrain Graph Completion Transformers
Recent work on Graph Neural Networks has demonstrated that self-supervised
pretraining can further enhance performance on downstream graph, link, and node
classification tasks. However, the efficacy of pretraining tasks has not been
fully investigated for downstream large knowledge graph completion tasks. Using
a contextualized knowledge graph embedding approach, we investigate five
different pretraining signals, constructed using several graph algorithms and
no external data, as well as their combination. We leverage the versatility of
our Transformer-based model to explore graph structure generation pretraining
tasks (i.e. path and k-hop neighborhood generation), typically inapplicable to
most graph embedding methods. We further propose a new path-finding algorithm
guided by information gain and find that it is the best-performing pretraining
task across three downstream knowledge graph completion datasets. While using
our new path-finding algorithm as a pretraining signal provides 2-3% MRR
improvements, we show that pretraining on all signals together gives the best
knowledge graph completion results. In a multitask setting that combines all
pretraining tasks, our method surpasses the latest and strong performing
knowledge graph embedding methods on all metrics for FB15K-237, on MRR and
Hit@1 for WN18RRand on MRR and hit@10 for JF17K (a knowledge hypergraph
dataset)
Evaluating Attention Networks for Anaphora Resolution
In this paper, we evaluate the results of using inter and intra attention mechanisms from two architectures, a Deep Attention Long Short-Term Memory-Network (LSTM-N) (Cheng et al., 2016) and a Decomposable Attention model (Parikh et al., 2016), for anaphora resolution, i.e. detecting coreference relations between a pronoun and a noun (its antecedent). The models are adapted from an entailment task, to address the pronominal coreference resolution task by comparing two pairs of sentences: one with the original sentences containing the antecedent and the pronoun, and another one with the pronoun replaced with a correct or an incorrect antecedent. The goal is thus to detect the correct replacements, assuming the original sentence pair entails the one with the correct replacement, but not one with an incorrect replacement. We use the CoNLL-2012 English dataset (Pradhan et al., 2012) to train the models and evaluate the ability to recognize correct and incorrect pronoun replacements in sentence pairs. We find that the Decomposable Attention Model performs better, while using a much simpler architecture. Furthermore, we focus on two previous studies that use intra- and inter-attention mechanisms, discuss how they relate to each other, and examine how these advances work to identify correct antecedent replacements
JaxPruner: A concise library for sparsity research
This paper introduces JaxPruner, an open-source JAX-based pruning and sparse
training library for machine learning research. JaxPruner aims to accelerate
research on sparse neural networks by providing concise implementations of
popular pruning and sparse training algorithms with minimal memory and latency
overhead. Algorithms implemented in JaxPruner use a common API and work
seamlessly with the popular optimization library Optax, which, in turn, enables
easy integration with existing JAX based libraries. We demonstrate this ease of
integration by providing examples in four different codebases: Scenic, t5x,
Dopamine and FedJAX and provide baseline experiments on popular benchmarks.Comment: Jaxpruner is hosted at http://github.com/google-research/jaxprune