2,847 research outputs found
Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation
Prior work on Sign Language Translation has shown that having a mid-level
sign gloss representation (effectively recognizing the individual signs)
improves the translation performance drastically. In fact, the current
state-of-the-art in translation requires gloss level tokenization in order to
work. We introduce a novel transformer based architecture that jointly learns
Continuous Sign Language Recognition and Translation while being trainable in
an end-to-end manner. This is achieved by using a Connectionist Temporal
Classification (CTC) loss to bind the recognition and translation problems into
a single unified architecture. This joint approach does not require any
ground-truth timing information, simultaneously solving two co-dependant
sequence-to-sequence learning problems and leads to significant performance
gains.
We evaluate the recognition and translation performances of our approaches on
the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset. We report
state-of-the-art sign language recognition and translation results achieved by
our Sign Language Transformers. Our translation networks outperform both sign
video to spoken language and gloss to spoken language translation models, in
some cases more than doubling the performance (9.58 vs. 21.80 BLEU-4 Score). We
also share new baseline translation results using transformer networks for
several other text-to-text sign language translation tasks
Relative Positional Encoding for Speech Recognition and Direct Translation
Transformer models are powerful sequence-to-sequence architectures that are
capable of directly mapping speech inputs to transcriptions or translations.
However, the mechanism for modeling positions in this model was tailored for
text modeling, and thus is less ideal for acoustic inputs. In this work, we
adapt the relative position encoding scheme to the Speech Transformer, where
the key addition is relative distance between input states in the
self-attention network. As a result, the network can better adapt to the
variable distributions present in speech data. Our experiments show that our
resulting model achieves the best recognition result on the Switchboard
benchmark in the non-augmentation condition, and the best published result in
the MuST-C speech translation benchmark. We also show that this model is able
to better utilize synthetic data than the Transformer, and adapts better to
variable sentence segmentation quality for speech translation.Comment: Submitted to Interspeech 202
Progressive Transformers for End-to-End Sign Language Production
The goal of automatic Sign Language Production (SLP) is to translate spoken
language to a continuous stream of sign language video at a level comparable to
a human translator. If this was achievable, then it would revolutionise Deaf
hearing communications. Previous work on predominantly isolated SLP has shown
the need for architectures that are better suited to the continuous domain of
full sign sequences.
In this paper, we propose Progressive Transformers, a novel architecture that
can translate from discrete spoken language sentences to continuous 3D skeleton
pose outputs representing sign language. We present two model configurations,
an end-to-end network that produces sign direct from text and a stacked network
that utilises a gloss intermediary.
Our transformer network architecture introduces a counter that enables
continuous sequence generation at training and inference. We also provide
several data augmentation processes to overcome the problem of drift and
improve the performance of SLP models. We propose a back translation evaluation
mechanism for SLP, presenting benchmark quantitative results on the challenging
RWTH-PHOENIX-Weather-2014T(PHOENIX14T) dataset and setting baselines for future
research
- …