5,590 research outputs found
Progressive Transformers for End-to-End Sign Language Production
The goal of automatic Sign Language Production (SLP) is to translate spoken
language to a continuous stream of sign language video at a level comparable to
a human translator. If this was achievable, then it would revolutionise Deaf
hearing communications. Previous work on predominantly isolated SLP has shown
the need for architectures that are better suited to the continuous domain of
full sign sequences.
In this paper, we propose Progressive Transformers, a novel architecture that
can translate from discrete spoken language sentences to continuous 3D skeleton
pose outputs representing sign language. We present two model configurations,
an end-to-end network that produces sign direct from text and a stacked network
that utilises a gloss intermediary.
Our transformer network architecture introduces a counter that enables
continuous sequence generation at training and inference. We also provide
several data augmentation processes to overcome the problem of drift and
improve the performance of SLP models. We propose a back translation evaluation
mechanism for SLP, presenting benchmark quantitative results on the challenging
RWTH-PHOENIX-Weather-2014T(PHOENIX14T) dataset and setting baselines for future
research
Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks
Sign languages are multi-channel visual languages, where signers use a
continuous 3D space to communicate.Sign Language Production (SLP), the
automatic translation from spoken to sign languages, must embody both the
continuous articulation and full morphology of sign to be truly understandable
by the Deaf community. Previous deep learning-based SLP works have produced
only a concatenation of isolated signs focusing primarily on the manual
features, leading to a robotic and non-expressive production.
In this work, we propose a novel Progressive Transformer architecture, the
first SLP model to translate from spoken language sentences to continuous 3D
multi-channel sign pose sequences in an end-to-end manner. Our transformer
network architecture introduces a counter decoding that enables variable length
continuous sequence generation by tracking the production progress over time
and predicting the end of sequence. We present extensive data augmentation
techniques to reduce prediction drift, alongside an adversarial training regime
and a Mixture Density Network (MDN) formulation to produce realistic and
expressive sign pose sequences.
We propose a back translation evaluation mechanism for SLP, presenting
benchmark quantitative results on the challenging PHOENIX14T dataset and
setting baselines for future research. We further provide a user evaluation of
our SLP model, to understand the Deaf reception of our sign pose productions
Adversarial Training for Multi-Channel Sign Language Production
Sign Languages are rich multi-channel languages, requiring articulation of
both manual (hands) and non-manual (face and body) features in a precise,
intricate manner. Sign Language Production (SLP), the automatic translation
from spoken to sign languages, must embody this full sign morphology to be
truly understandable by the Deaf community. Previous work has mainly focused on
manual feature production, with an under-articulated output caused by
regression to the mean.
In this paper, we propose an Adversarial Multi-Channel approach to SLP. We
frame sign production as a minimax game between a transformer-based Generator
and a conditional Discriminator. Our adversarial discriminator evaluates the
realism of sign production conditioned on the source text, pushing the
generator towards a realistic and articulate output. Additionally, we fully
encapsulate sign articulators with the inclusion of non-manual features,
producing facial features and mouthing patterns.
We evaluate on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T)
dataset, and report state-of-the art SLP back-translation performance for
manual production. We set new benchmarks for the production of multi-channel
sign to underpin future research into realistic SLP
Sign Language Production with Latent Motion Transformer
Sign Language Production (SLP) is the tough task of turning sign language
into sign videos. The main goal of SLP is to create these videos using a sign
gloss. In this research, we've developed a new method to make high-quality sign
videos without using human poses as a middle step. Our model works in two main
parts: first, it learns from a generator and the video's hidden features, and
next, it uses another model to understand the order of these hidden features.
To make this method even better for sign videos, we make several significant
improvements. (i) In the first stage, we take an improved 3D VQ-GAN to learn
downsampled latent representations. (ii) In the second stage, we introduce
sequence-to-sequence attention to better leverage conditional information.
(iii) The separated two-stage training discards the realistic visual semantic
of the latent codes in the second stage. To endow the latent sequences semantic
information, we extend the token-level autoregressive latent codes learning
with perceptual loss and reconstruction loss for the prior model with visual
perception. Compared with previous state-of-the-art approaches, our model
performs consistently better on two word-level sign language datasets, i.e.,
WLASL and NMFs-CSL.Comment: Accepted by WACV202
Advancing Text-to-GLOSS Neural Translation Using a Novel Hyper-parameter Optimization Technique
In this paper, we investigate the use of transformers for Neural Machine
Translation of text-to-GLOSS for Deaf and Hard-of-Hearing communication. Due to
the scarcity of available data and limited resources for text-to-GLOSS
translation, we treat the problem as a low-resource language task. We use our
novel hyper-parameter exploration technique to explore a variety of
architectural parameters and build an optimal transformer-based architecture
specifically tailored for text-to-GLOSS translation. The study aims to improve
the accuracy and fluency of Neural Machine Translation generated GLOSS. This is
achieved by examining various architectural parameters including layer count,
attention heads, embedding dimension, dropout, and label smoothing to identify
the optimal architecture for improving text-to-GLOSS translation performance.
The experiments conducted on the PHOENIX14T dataset reveal that the optimal
transformer architecture outperforms previous work on the same dataset. The
best model reaches a ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
score of 55.18% and a BLEU-1 (BiLingual Evaluation Understudy 1) score of
63.6%, outperforming state-of-the-art results on the BLEU1 and ROUGE score by
8.42 and 0.63 respectively.Comment: 8 pages, 5 figure
SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models
Despite achieving remarkable performance on various vision-language tasks,
Transformer-based Vision-Language Models (VLMs) suffer from redundancy in
inputs and parameters, significantly hampering their efficiency in real-world
applications. Moreover, the degree of redundancy in token representations and
model parameters, such as attention heads, varies significantly for different
inputs. In light of the challenges, we propose SmartTrim, an adaptive
acceleration framework for VLMs, which adjusts the computational overhead per
instance. Specifically, we integrate lightweight modules into the original
backbone to identify and prune redundant token representations and attention
heads within each layer. Furthermore, we devise a self-distillation strategy to
enhance the consistency between the predictions of the pruned model and its
fully-capacity counterpart. Experimental results across various vision-language
tasks consistently demonstrate that SmartTrim accelerates the original model by
2-3 times with minimal performance degradation, highlighting the effectiveness
and efficiency compared to previous approaches. Code will be available at
https://github.com/kugwzk/SmartTrim.Comment: COLING-LREC 202
ISLTranslate: Dataset for Translating Indian Sign Language
Sign languages are the primary means of communication for many
hard-of-hearing people worldwide. Recently, to bridge the communication gap
between the hard-of-hearing community and the rest of the population, several
sign language translation datasets have been proposed to enable the development
of statistical sign language translation systems. However, there is a dearth of
sign language resources for the Indian sign language. This resource paper
introduces ISLTranslate, a translation dataset for continuous Indian Sign
Language (ISL) consisting of 31k ISL-English sentence/phrase pairs. To the best
of our knowledge, it is the largest translation dataset for continuous Indian
Sign Language. We provide a detailed analysis of the dataset. To validate the
performance of existing end-to-end Sign language to spoken language translation
systems, we benchmark the created dataset with a transformer-based model for
ISL translation.Comment: Accepted at ACL 2023 Findings, 8 Page
Considerations for meaningful sign language machine translation based on glosses
Automatic sign language processing is gaining popularity in Natural Language
Processing (NLP) research (Yin et al., 2021). In machine translation (MT) in
particular, sign language translation based on glosses is a prominent approach.
In this paper, we review recent works on neural gloss translation. We find that
limitations of glosses in general and limitations of specific datasets are not
discussed in a transparent manner and that there is no common standard for
evaluation.
To address these issues, we put forward concrete recommendations for future
research on gloss translation. Our suggestions advocate awareness of the
inherent limitations of gloss-based approaches, realistic datasets, stronger
baselines and convincing evaluation
Recommended from our members
UC Berkeley's Cory Hall: Evaluation of Challenges and Potential Applications of Building-to-Grid Implementation
From September 2009 through June 2010, a team of researchers developed, installed, and tested instrumentation on the energy flows in Cory Hall on the UC Berkeley campus to create a Building-to-Grid testbed. The UC Berkeley team was headed by Professor David Culler, and assisted by members from EnerNex, Lawrence Berkeley National Laboratory, California State University Sacramento, and the California Institute for Energy & Environment. While the Berkeley team mapped the load tree of the building, EnerNex researched types of meters, submeters, monitors, and sensors to be used (Task 1). Next the UC Berkeley team analyzed building needs and designed the network of metering components and data storage/visualization software (Task 2). After meeting with vendors in January, the UCB team procured and installed the components starting in late March (Task 3). Next, the UCB team tested and demonstrated the system (Task 4). Meanwhile, the CSUS team documented the methodology and steps necessary to implement a testbed (Task 5) and Harold Galicer developed a roadmap for the CSUS Smart Grid Center with results from the testbed (Task 5a) and evaluated the Cory Hall implementation process (Task 5b). The CSUS team also worked with local utilities to develop an approach to the energy information communication link between buildings and the utility (Task 6). The UC Berkeley team then prepared a roadmap to outline necessary technology development for Building-to-Grid, and presented the results of the project in early July (Task 7). Finally, CIEE evaluated the implementation, noting challenges and potential applications of Building-to-Grid (Task 8). These deliverables are available at the i4Energy site: http://i4energy.org/
A Comprehensive Approach to Automated Sign Language Translation
Many sign languages are bonafide natural languages with grammatical rules and lexicons, hence can benefit from neural machine translation methods. As significant advances are being made in natural language processing (specifically neural machine translation) and in computer vision processes, specifically image and video captioning, related methods can be further researched to boost automated sign language understanding. This is an especially challenging AI research area due to the involvement of a continuous visual-spatial modality, where meaning is often derived from context. To this end, this thesis is focused on the study and development of new computational methods and training mechanisms to enhance sign language translation in two directions, signs to texts and texts to signs. This work introduces a new, realistic phrase-level American Sign Language dataset (ASL/ ASLing), and investigates the role of different types of visual features (CNN embeddings, human body keypoints, and optical flow vectors) in translating ASL to spoken American English. Additionally, the research considers the role of multiple features for improved translation, via various fusion architectures. As an added benefit, with continuous sign language being challenging to segment, this work also explores the use of overlapping scaled visual segments, across the video, for simultaneously segmenting and translating signs. Finally, a quintessential interpreting agent not only understands sign language and translates to text, but also understands the text and translates to signs. Hence, to facilitate two-way sign language communication, i.e. visual sign to spoken language translation and spoken to visual sign language translation, a dual neural machine translation model, SignNet, is presented. Various training paradigms are investigated for improved translation, using SignNet. By exploiting the notion of similarity (and dissimilarity) of visual signs, a metric embedding learning process proved most useful in training SignNet. The resulting processes outperformed their state-of-the-art counterparts by showing noteworthy improvements in BLEU 1 - BLEU 4 scores
- …