3,160 research outputs found
Efficient Document Re-Ranking for Transformers by Precomputing Term Representations
Deep pretrained transformer networks are effective at various ranking tasks,
such as question answering and ad-hoc document ranking. However, their
computational expenses deem them cost-prohibitive in practice. Our proposed
approach, called PreTTR (Precomputing Transformer Term Representations),
considerably reduces the query-time latency of deep transformer networks (up to
a 42x speedup on web document ranking) making these networks more practical to
use in a real-time ranking scenario. Specifically, we precompute part of the
document term representations at indexing time (without a query), and merge
them with the query representation at query time to compute the final ranking
score. Due to the large size of the token representations, we also propose an
effective approach to reduce the storage requirement by training a compression
layer to match attention scores. Our compression technique reduces the storage
required up to 95% and it can be applied without a substantial degradation in
ranking performance.Comment: Accepted at SIGIR 2020 (long
LSG Attention: Extrapolation of pretrained Transformers to long sequences
Transformer models achieve state-of-the-art performance on a wide range of
NLP tasks. They however suffer from a prohibitive limitation due to the
self-attention mechanism, inducing complexity with regard to sequence
length. To answer this limitation we introduce the LSG architecture which
relies on Local, Sparse and Global attention. We show that LSG attention is
fast, efficient and competitive in classification and summarization tasks on
long documents. Interestingly, it can also be used to adapt existing pretrained
models to efficiently extrapolate to longer sequences with no additional
training. Along with the introduction of the LSG attention mechanism, we
propose tools to train new models and adapt existing ones based on this
mechanism
Better, Faster, Stronger Sequence Tagging Constituent Parsers
Sequence tagging models for constituent parsing are faster, but less accurate
than other types of parsers. In this work, we address the following weaknesses
of such constituent parsers: (a) high error rates around closing brackets of
long constituents, (b) large label sets, leading to sparsity, and (c) error
propagation arising from greedy decoding. To effectively close brackets, we
train a model that learns to switch between tagging schemes. To reduce
sparsity, we decompose the label set and use multi-task learning to jointly
learn to predict sublabels. Finally, we mitigate issues from greedy decoding
through auxiliary losses and sentence-level fine-tuning with policy gradient.
Combining these techniques, we clearly surpass the performance of sequence
tagging constituent parsers on the English and Chinese Penn Treebanks, and
reduce their parsing time even further. On the SPMRL datasets, we observe even
greater improvements across the board, including a new state of the art on
Basque, Hebrew, Polish and Swedish.Comment: NAACL 2019 (long papers). Contains corrigendu
BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model with Non-textual Features for CTR Prediction
Although deep pre-trained language models have shown promising benefit in a
large set of industrial scenarios, including Click-Through-Rate (CTR)
prediction, how to integrate pre-trained language models that handle only
textual signals into a prediction pipeline with non-textual features is
challenging.
Up to now two directions have been explored to integrate multi-modal inputs
in fine-tuning of pre-trained language models. One consists of fusing the
outcome of language models and non-textual features through an aggregation
layer, resulting into ensemble framework, where the cross-information between
textual and non-textual inputs are only learned in the aggregation layer. The
second one consists of splitting non-textual features into fine-grained
fragments and transforming the fragments to new tokens combined with textual
ones, so that they can be fed directly to transformer layers in language
models. However, this approach increases the complexity of the learning and
inference because of the numerous additional tokens.
To address these limitations, we propose in this work a novel framework
BERT4CTR, with the Uni-Attention mechanism that can benefit from the
interactions between non-textual and textual features while maintaining low
time-costs in training and inference through a dimensionality reduction.
Comprehensive experiments on both public and commercial data demonstrate that
BERT4CTR can outperform significantly the state-of-the-art frameworks to handle
multi-modal inputs and be applicable to CTR prediction
Transformer Networks for Trajectory Forecasting
Most recent successes on forecasting the people motion are based on LSTM
models and all most recent progress has been achieved by modelling the social
interaction among people and the people interaction with the scene. We question
the use of the LSTM models and propose the novel use of Transformer Networks
for trajectory forecasting. This is a fundamental switch from the sequential
step-by-step processing of LSTMs to the only-attention-based memory mechanisms
of Transformers. In particular, we consider both the original Transformer
Network (TF) and the larger Bidirectional Transformer (BERT), state-of-the-art
on all natural language processing tasks. Our proposed Transformers predict the
trajectories of the individual people in the scene. These are "simple" model
because each person is modelled separately without any complex human-human nor
scene interaction terms. In particular, the TF model without bells and whistles
yields the best score on the largest and most challenging trajectory
forecasting benchmark of TrajNet. Additionally, its extension which predicts
multiple plausible future trajectories performs on par with more engineered
techniques on the 5 datasets of ETH + UCY. Finally, we show that Transformers
may deal with missing observations, as it may be the case with real sensor
data. Code is available at https://github.com/FGiuliari/Trajectory-Transformer.Comment: 18 pages, 3 figure
- …