227 research outputs found
Partial sequence labeling with structured Gaussian Processes
Existing partial sequence labeling models mainly focus on max-margin
framework which fails to provide an uncertainty estimation of the prediction.
Further, the unique ground truth disambiguation strategy employed by these
models may include wrong label information for parameter learning. In this
paper, we propose structured Gaussian Processes for partial sequence labeling
(SGPPSL), which encodes uncertainty in the prediction and does not need extra
effort for model selection and hyperparameter learning. The model employs
factor-as-piece approximation that divides the linear-chain graph structure
into the set of pieces, which preserves the basic Markov Random Field structure
and effectively avoids handling large number of candidate output sequences
generated by partially annotated data. Then confidence measure is introduced in
the model to address different contributions of candidate labels, which enables
the ground-truth label information to be utilized in parameter learning. Based
on the derived lower bound of the variational lower bound of the proposed
model, variational parameters and confidence measures are estimated in the
framework of alternating optimization. Moreover, weighted Viterbi algorithm is
proposed to incorporate confidence measure to sequence prediction, which
considers label ambiguity arose from multiple annotations in the training data
and thus helps improve the performance. SGPPSL is evaluated on several sequence
labeling tasks and the experimental results show the effectiveness of the
proposed model
Efficient Long-Text Understanding with Short-Text Models
Transformer-based pretrained language models (LMs) are ubiquitous across
natural language understanding, but cannot be applied to long sequences such as
stories, scientific articles and long documents, due to their quadratic
complexity. While a myriad of efficient transformer variants have been
proposed, they are typically based on custom implementations that require
expensive pretraining from scratch. In this work, we propose SLED:
SLiding-Encoder and Decoder, a simple approach for processing long sequences
that re-uses and leverages battle-tested short-text pretrained LMs.
Specifically, we partition the input into overlapping chunks, encode each with
a short-text LM encoder and use the pretrained decoder to fuse information
across chunks (fusion-in-decoder). We illustrate through controlled experiments
that SLED offers a viable strategy for long text understanding and evaluate our
approach on SCROLLS, a benchmark with seven datasets across a wide range of
language understanding tasks. We find that SLED is competitive with specialized
models that are up to 50x larger and require a dedicated and expensive
pretraining step.Comment: Accepted for publication in Transactions of the Association for
Computational Linguistics (TACL), 2023. Authors' final version (pre-MIT
LSG Attention: Extrapolation of pretrained Transformers to long sequences
Transformer models achieve state-of-the-art performance on a wide range of
NLP tasks. They however suffer from a prohibitive limitation due to the
self-attention mechanism, inducing complexity with regard to sequence
length. To answer this limitation we introduce the LSG architecture which
relies on Local, Sparse and Global attention. We show that LSG attention is
fast, efficient and competitive in classification and summarization tasks on
long documents. Interestingly, it can also be used to adapt existing pretrained
models to efficiently extrapolate to longer sequences with no additional
training. Along with the introduction of the LSG attention mechanism, we
propose tools to train new models and adapt existing ones based on this
mechanism
- …