7,745 research outputs found
Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation
Prior work on Sign Language Translation has shown that having a mid-level
sign gloss representation (effectively recognizing the individual signs)
improves the translation performance drastically. In fact, the current
state-of-the-art in translation requires gloss level tokenization in order to
work. We introduce a novel transformer based architecture that jointly learns
Continuous Sign Language Recognition and Translation while being trainable in
an end-to-end manner. This is achieved by using a Connectionist Temporal
Classification (CTC) loss to bind the recognition and translation problems into
a single unified architecture. This joint approach does not require any
ground-truth timing information, simultaneously solving two co-dependant
sequence-to-sequence learning problems and leads to significant performance
gains.
We evaluate the recognition and translation performances of our approaches on
the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset. We report
state-of-the-art sign language recognition and translation results achieved by
our Sign Language Transformers. Our translation networks outperform both sign
video to spoken language and gloss to spoken language translation models, in
some cases more than doubling the performance (9.58 vs. 21.80 BLEU-4 Score). We
also share new baseline translation results using transformer networks for
several other text-to-text sign language translation tasks
Sign language translation with pseudo-glosses
La Traducció de la Llengua de Signes és un problema obert que té com a objectiu generar frases escrites a partir de vídeos de signes. En els darrers anys, molts treballs de recerca que s'han desenvolupat en aquest camp van abordar principalment la tasca de Reconeixement de la Llengua de Signes, que consisteix a comprendre els signes d'entrada i transcriure'ls en seqüències d'anotacions. A més, els estudis actuals mostren que aprofitar aquesta darrera tasca ajuda a aprendre representacions significatives i es pot veure com un pas intermig cap a l'objectiu final de traducció. En aquest treball, presentem un mètode per generar pseudo-glosses automàtiques a partir de les frases escrites, que pot funcionar com a substitució de les glosses reals. Això aborda el problema de la seva adquisició, ja que s'han d'anotar manualment i és extremadament costós. A més, introduïm una nova implementació basada en Fairseq de l'enfocament del model Transformer introduït per Camgoz et al., que està entrenat conjuntament per resoldre les tasques de reconeixement i traducció. També proporcionem nous resultats de referència per ambdues implementacions: en primer lloc, per la base de dades Phoenix, presentem resultats que superen els proporcionats per Camgoz et al. en el seu treball i, en segon lloc, per la base de dades How2Sign, presentem els primers resultats de la tasca de traducció. Aquests resultats poden servir de base per a futures investigacions en el camp.Sign Language Translation is an open problem whose goal is to generate written sentences from sign videos. In recent years, many research works that have been developed in this field mainly addressed the Sign Language Recognition task, which consists in understanding the input signs and transcribing them into sequences of annotations. Moreover, current studies show that taking advantage of the latter task helps to learn meaningful representations and can be seen as an intermediate step towards the end goal of translation. In this work, we present a method to generate automatic pseudo-glosses from written sentences, which can work as a replacement for real glosses. This addresses the issue of their collection, as they need to be manually annotated and it is extremely costly. Furthermore, we introduce a new implementation built on Fairseq of the Transformer-model approach introduced by Camgoz et al., which is jointly trained to solve the recognition and translation tasks. Besides, we provide new baseline results on both implementations: first, on the Phoenix dataset, we present results that outperform the ones provided by Camgoz et al. in their work, and, second, on the How2Sign dataset, we present the first results on the translation task. These results can work as a baseline for future research in the field
Artificial Intelligence for Sign Language Recognition and Translation
In a world where people are more connected, the barriers between deaf people and hearing
people is more visible than ever. A neural sign language translation system would break
many of these barriers. However, there are still many tasks to be solved before full automatic
sign language translation is possible. Sign Language Translation is a difficult multimodal
machine translation problem with no clear one-to-one mapping to any spoken language.
In this paper I give a review of sign language and its challenges regarding neural machine
translation. I evaluate the state-of-the-art Sign Language Translation approach, and apply
a modified version of the Evolved Transformer to the existing Sign Language Transformer.
I show that the Evolved Transformer encoder produces better results over the Transformer
encoder with lower dimensions
Leveraging Graph-based Cross-modal Information Fusion for Neural Sign Language Translation
Sign Language (SL), as the mother tongue of the deaf community, is a special
visual language that most hearing people cannot understand. In recent years,
neural Sign Language Translation (SLT), as a possible way for bridging
communication gap between the deaf and the hearing people, has attracted
widespread academic attention. We found that the current mainstream end-to-end
neural SLT models, which tries to learning language knowledge in a weakly
supervised manner, could not mine enough semantic information under the
condition of low data resources. Therefore, we propose to introduce additional
word-level semantic knowledge of sign language linguistics to assist in
improving current end-to-end neural SLT models. Concretely, we propose a novel
neural SLT model with multi-modal feature fusion based on the dynamic graph, in
which the cross-modal information, i.e. text and video, is first assembled as a
dynamic graph according to their correlation, and then the graph is processed
by a multi-modal graph encoder to generate the multi-modal embeddings for
further usage in the subsequent neural translation models. To the best of our
knowledge, we are the first to introduce graph neural networks, for fusing
multi-modal information, into neural sign language translation models.
Moreover, we conducted experiments on a publicly available popular SLT dataset
RWTH-PHOENIX-Weather-2014T. and the quantitative experiments show that our
method can improve the model
Vector Quantized Diffusion Model with CodeUnet for Text-to-Sign Pose Sequences Generation
Sign Language Production (SLP) aims to translate spoken languages into sign
sequences automatically. The core process of SLP is to transform sign gloss
sequences into their corresponding sign pose sequences (G2P). Most existing G2P
models usually perform this conditional long-range generation in an
autoregressive manner, which inevitably leads to an accumulation of errors. To
address this issue, we propose a vector quantized diffusion method for
conditional pose sequences generation, called PoseVQ-Diffusion, which is an
iterative non-autoregressive method. Specifically, we first introduce a vector
quantized variational autoencoder (Pose-VQVAE) model to represent a pose
sequence as a sequence of latent codes. Then we model the latent discrete space
by an extension of the recently developed diffusion architecture. To better
leverage the spatial-temporal information, we introduce a novel architecture,
namely CodeUnet, to generate higher quality pose sequence in the discrete
space. Moreover, taking advantage of the learned codes, we develop a novel
sequential k-nearest-neighbours method to predict the variable lengths of pose
sequences for corresponding gloss sequences. Consequently, compared with the
autoregressive G2P models, our model has a faster sampling speed and produces
significantly better results. Compared with previous non-autoregressive G2P
methods, PoseVQ-Diffusion improves the predicted results with iterative
refinements, thus achieving state-of-the-art results on the SLP evaluation
benchmark
Gloss Alignment Using Word Embeddings
Capturing and annotating Sign language datasets is a time consuming and
costly process. Current datasets are orders of magnitude too small to
successfully train unconstrained \acf{slt} models. As a result, research has
turned to TV broadcast content as a source of large-scale training data,
consisting of both the sign language interpreter and the associated audio
subtitle. However, lack of sign language annotation limits the usability of
this data and has led to the development of automatic annotation techniques
such as sign spotting. These spottings are aligned to the video rather than the
subtitle, which often results in a misalignment between the subtitle and
spotted signs. In this paper we propose a method for aligning spottings with
their corresponding subtitles using large spoken language models. Using a
single modality means our method is computationally inexpensive and can be
utilized in conjunction with existing alignment techniques. We quantitatively
demonstrate the effectiveness of our method on the \acf{mdgs} and \acf{bobsl}
datasets, recovering up to a 33.22 BLEU-1 score in word alignment.Comment: 4 pages, 4 figures, 2023 IEEE International Conference on Acoustics,
Speech, and Signal Processing Workshops (ICASSPW
Progressive Transformers for End-to-End Sign Language Production
The goal of automatic Sign Language Production (SLP) is to translate spoken
language to a continuous stream of sign language video at a level comparable to
a human translator. If this was achievable, then it would revolutionise Deaf
hearing communications. Previous work on predominantly isolated SLP has shown
the need for architectures that are better suited to the continuous domain of
full sign sequences.
In this paper, we propose Progressive Transformers, a novel architecture that
can translate from discrete spoken language sentences to continuous 3D skeleton
pose outputs representing sign language. We present two model configurations,
an end-to-end network that produces sign direct from text and a stacked network
that utilises a gloss intermediary.
Our transformer network architecture introduces a counter that enables
continuous sequence generation at training and inference. We also provide
several data augmentation processes to overcome the problem of drift and
improve the performance of SLP models. We propose a back translation evaluation
mechanism for SLP, presenting benchmark quantitative results on the challenging
RWTH-PHOENIX-Weather-2014T(PHOENIX14T) dataset and setting baselines for future
research
- …