3 research outputs found
Low Resource Multi-Task Sequence Tagging -- Revisiting Dynamic Conditional Random Fields
We compare different models for low resource multi-task sequence tagging that
leverage dependencies between label sequences for different tasks. Our analysis
is aimed at datasets where each example has labels for multiple tasks. Current
approaches use either a separate model for each task or standard multi-task
learning to learn shared feature representations. However, these approaches
ignore correlations between label sequences, which can provide important
information in settings with small training datasets. To analyze which
scenarios can profit from modeling dependencies between labels in different
tasks, we revisit dynamic conditional random fields (CRFs) and combine them
with deep neural networks. We compare single-task, multi-task and dynamic CRF
setups for three diverse datasets at both sentence and document levels in
English and German low resource scenarios. We show that including silver labels
from pretrained part-of-speech taggers as auxiliary tasks can improve
performance on downstream tasks. We find that especially in low-resource
scenarios, the explicit modeling of inter-dependencies between task predictions
outperforms single-task as well as standard multi-task models
Modular and Parameter-efficient Fine-tuning of Language Models
Transfer learning has recently become the dominant paradigm of natural language processing. Models pre-trained on unlabeled data can be fine-tuned for downstream tasks based on only a handful of examples. A long-term goal is to develop models that acquire new information at scale without incurring negative transfer and that generalize systematically to new settings. Modular deep learning has emerged as a promising solution to these challenges, by updating parameter-efficient units of computation locally and asynchronously. These units are often implemented as modules that are interlaid between layers, interpolated with pre-trained parameters, or concatenated to the inputs. Conditioned on tasks or examples, information is routed to multiple modules through a fixed or learned function, followed by an aggregation of their outputs. This property enables compositional generalization, by disentangling knowledge and recombining it in new ways.
In this thesis, we provide a unified view of modularity in natural language processing, spanning across four dimensions; specifically, we disentangle modularity into computation functions, routing functions, aggregation functions, and the training setting. Along those axes, we propose multiple contributions: a research framework which encompasses all dimensions; a novel attention-based aggregation function which combines the knowledge stored within different modules; routing mechanisms for out of distribution generalization in cross-lingual transfer scenarios; a dataset and modular training strategies for multimodal and multilingual transfer learning; a modular pre-training strategy to tackle catastrophic interference of heterogeneous data