Search CORE

5,590 research outputs found

Progressive Transformers for End-to-End Sign Language Production

Author: H Cooper
K Karpouzis
M Plappert
M Süzgün
S Tamura
SK Ko
T Starner
Publication venue
Publication date: 02/07/2020
Field of study

The goal of automatic Sign Language Production (SLP) is to translate spoken language to a continuous stream of sign language video at a level comparable to a human translator. If this was achievable, then it would revolutionise Deaf hearing communications. Previous work on predominantly isolated SLP has shown the need for architectures that are better suited to the continuous domain of full sign sequences. In this paper, we propose Progressive Transformers, a novel architecture that can translate from discrete spoken language sentences to continuous 3D skeleton pose outputs representing sign language. We present two model configurations, an end-to-end network that produces sign direct from text and a stacked network that utilises a gloss intermediary. Our transformer network architecture introduces a counter that enables continuous sequence generation at training and inference. We also provide several data augmentation processes to overcome the problem of drift and improve the performance of SLP models. We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging RWTH-PHOENIX-Weather-2014T(PHOENIX14T) dataset and setting baselines for future research

arXiv.org e-Print Archive

Crossref

University of Surrey

Surrey Research Insight

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

Author: Bowden Richard
Camgoz Necati Cihan
Saunders Ben
Publication venue
Publication date: 11/03/2021
Field of study

Sign languages are multi-channel visual languages, where signers use a continuous 3D space to communicate.Sign Language Production (SLP), the automatic translation from spoken to sign languages, must embody both the continuous articulation and full morphology of sign to be truly understandable by the Deaf community. Previous deep learning-based SLP works have produced only a concatenation of isolated signs focusing primarily on the manual features, leading to a robotic and non-expressive production. In this work, we propose a novel Progressive Transformer architecture, the first SLP model to translate from spoken language sentences to continuous 3D multi-channel sign pose sequences in an end-to-end manner. Our transformer network architecture introduces a counter decoding that enables variable length continuous sequence generation by tracking the production progress over time and predicting the end of sequence. We present extensive data augmentation techniques to reduce prediction drift, alongside an adversarial training regime and a Mixture Density Network (MDN) formulation to produce realistic and expressive sign pose sequences. We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging PHOENIX14T dataset and setting baselines for future research. We further provide a user evaluation of our SLP model, to understand the Deaf reception of our sign pose productions

arXiv.org e-Print Archive

University of Surrey

Adversarial Training for Multi-Channel Sign Language Production

Author: Bowden Richard
Camgoz Necati Cihan
Saunders Ben
Publication venue
Publication date: 05/08/2020
Field of study

Sign Languages are rich multi-channel languages, requiring articulation of both manual (hands) and non-manual (face and body) features in a precise, intricate manner. Sign Language Production (SLP), the automatic translation from spoken to sign languages, must embody this full sign morphology to be truly understandable by the Deaf community. Previous work has mainly focused on manual feature production, with an under-articulated output caused by regression to the mean. In this paper, we propose an Adversarial Multi-Channel approach to SLP. We frame sign production as a minimax game between a transformer-based Generator and a conditional Discriminator. Our adversarial discriminator evaluates the realism of sign production conditioned on the source text, pushing the generator towards a realistic and articulate output. Additionally, we fully encapsulate sign articulators with the inclusion of non-manual features, producing facial features and mouthing patterns. We evaluate on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset, and report state-of-the art SLP back-translation performance for manual production. We set new benchmarks for the production of multi-channel sign to underpin future research into realistic SLP

arXiv.org e-Print Archive

University of Surrey

Surrey Research Insight

Sign Language Production with Latent Motion Transformer

Author: Du Yao
Peng Taiyi
Xie Pan
Zhang Qipeng
Publication venue
Publication date: 20/12/2023
Field of study

Sign Language Production (SLP) is the tough task of turning sign language into sign videos. The main goal of SLP is to create these videos using a sign gloss. In this research, we've developed a new method to make high-quality sign videos without using human poses as a middle step. Our model works in two main parts: first, it learns from a generator and the video's hidden features, and next, it uses another model to understand the order of these hidden features. To make this method even better for sign videos, we make several significant improvements. (i) In the first stage, we take an improved 3D VQ-GAN to learn downsampled latent representations. (ii) In the second stage, we introduce sequence-to-sequence attention to better leverage conditional information. (iii) The separated two-stage training discards the realistic visual semantic of the latent codes in the second stage. To endow the latent sequences semantic information, we extend the token-level autoregressive latent codes learning with perceptual loss and reconstruction loss for the prior model with visual perception. Compared with previous state-of-the-art approaches, our model performs consistently better on two word-level sign language datasets, i.e., WLASL and NMFs-CSL.Comment: Accepted by WACV202

arXiv.org e-Print Archive

Advancing Text-to-GLOSS Neural Translation Using a Novel Hyper-parameter Optimization Technique

Author: Khattabi Noussaima El
Ouargani Younes
Publication venue
Publication date: 05/09/2023
Field of study

In this paper, we investigate the use of transformers for Neural Machine Translation of text-to-GLOSS for Deaf and Hard-of-Hearing communication. Due to the scarcity of available data and limited resources for text-to-GLOSS translation, we treat the problem as a low-resource language task. We use our novel hyper-parameter exploration technique to explore a variety of architectural parameters and build an optimal transformer-based architecture specifically tailored for text-to-GLOSS translation. The study aims to improve the accuracy and fluency of Neural Machine Translation generated GLOSS. This is achieved by examining various architectural parameters including layer count, attention heads, embedding dimension, dropout, and label smoothing to identify the optimal architecture for improving text-to-GLOSS translation performance. The experiments conducted on the PHOENIX14T dataset reveal that the optimal transformer architecture outperforms previous work on the same dataset. The best model reaches a ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score of 55.18% and a BLEU-1 (BiLingual Evaluation Understudy 1) score of 63.6%, outperforming state-of-the-art results on the BLEU1 and ROUGE score by 8.42 and 0.63 respectively.Comment: 8 pages, 5 figure

arXiv.org e-Print Archive

SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models

Author: Chen Jingchang
Liang Jiafeng
Liu Ming
Qin Bing
Shan Liping
Wang Zekun
Xu Dongliang
Yang Qing
Zhou Wangchunshu
Zhu Haichao
Publication venue
Publication date: 26/02/2024
Field of study

Despite achieving remarkable performance on various vision-language tasks, Transformer-based Vision-Language Models (VLMs) suffer from redundancy in inputs and parameters, significantly hampering their efficiency in real-world applications. Moreover, the degree of redundancy in token representations and model parameters, such as attention heads, varies significantly for different inputs. In light of the challenges, we propose SmartTrim, an adaptive acceleration framework for VLMs, which adjusts the computational overhead per instance. Specifically, we integrate lightweight modules into the original backbone to identify and prune redundant token representations and attention heads within each layer. Furthermore, we devise a self-distillation strategy to enhance the consistency between the predictions of the pruned model and its fully-capacity counterpart. Experimental results across various vision-language tasks consistently demonstrate that SmartTrim accelerates the original model by 2-3 times with minimal performance degradation, highlighting the effectiveness and efficiency compared to previous approaches. Code will be available at https://github.com/kugwzk/SmartTrim.Comment: COLING-LREC 202

arXiv.org e-Print Archive

ISLTranslate: Dataset for Translating Indian Sign Language

Author: Agrawal Susmit
Joshi Abhinav
Modi Ashutosh
Publication venue
Publication date: 11/07/2023
Field of study

Sign languages are the primary means of communication for many hard-of-hearing people worldwide. Recently, to bridge the communication gap between the hard-of-hearing community and the rest of the population, several sign language translation datasets have been proposed to enable the development of statistical sign language translation systems. However, there is a dearth of sign language resources for the Indian sign language. This resource paper introduces ISLTranslate, a translation dataset for continuous Indian Sign Language (ISL) consisting of 31k ISL-English sentence/phrase pairs. To the best of our knowledge, it is the largest translation dataset for continuous Indian Sign Language. We provide a detailed analysis of the dataset. To validate the performance of existing end-to-end Sign language to spoken language translation systems, we benchmark the created dataset with a transformer-based model for ISL translation.Comment: Accepted at ACL 2023 Findings, 8 Page

arXiv.org e-Print Archive

Considerations for meaningful sign language machine translation based on glosses

Author: Ebling Sarah
Jiang Zifan
Moryossef Amit
Müller Mathias
Rios Annette
Publication venue
Publication date: 28/11/2022
Field of study

Automatic sign language processing is gaining popularity in Natural Language Processing (NLP) research (Yin et al., 2021). In machine translation (MT) in particular, sign language translation based on glosses is a prominent approach. In this paper, we review recent works on neural gloss translation. We find that limitations of glosses in general and limitations of specific datasets are not discussed in a transparent manner and that there is no common standard for evaluation. To address these issues, we put forward concrete recommendations for future research on gloss translation. Our suggestions advocate awareness of the inherent limitations of gloss-based approaches, realistic datasets, stronger baselines and convincing evaluation

arXiv.org e-Print Archive

ZORA

Recommended from our members

UC Berkeley's Cory Hall: Evaluation of Challenges and Potential Applications of Building-to-Grid Implementation

Author: Peffer Therese
Publication venue: eScholarship, University of California
Publication date: 01/01/2010
Field of study

From September 2009 through June 2010, a team of researchers developed, installed, and tested instrumentation on the energy flows in Cory Hall on the UC Berkeley campus to create a Building-to-Grid testbed. The UC Berkeley team was headed by Professor David Culler, and assisted by members from EnerNex, Lawrence Berkeley National Laboratory, California State University Sacramento, and the California Institute for Energy & Environment. While the Berkeley team mapped the load tree of the building, EnerNex researched types of meters, submeters, monitors, and sensors to be used (Task 1). Next the UC Berkeley team analyzed building needs and designed the network of metering components and data storage/visualization software (Task 2). After meeting with vendors in January, the UCB team procured and installed the components starting in late March (Task 3). Next, the UCB team tested and demonstrated the system (Task 4). Meanwhile, the CSUS team documented the methodology and steps necessary to implement a testbed (Task 5) and Harold Galicer developed a roadmap for the CSUS Smart Grid Center with results from the testbed (Task 5a) and evaluated the Cory Hall implementation process (Task 5b). The CSUS team also worked with local utilities to develop an approach to the energy information communication link between buildings and the utility (Task 6). The UC Berkeley team then prepared a roadmap to outline necessary technology development for Building-to-Grid, and presented the results of the project in early July (Task 7). Finally, CIEE evaluated the implementation, noting challenges and potential applications of Building-to-Grid (Task 8). These deliverables are available at the i4Energy site: http://i4energy.org/

eScholarship - University of California

A Comprehensive Approach to Automated Sign Language Translation

Author: Ananthanarayana Tejaswini
Publication venue: RIT Scholar Works
Publication date: 01/11/2021
Field of study

Many sign languages are bonafide natural languages with grammatical rules and lexicons, hence can benefit from neural machine translation methods. As significant advances are being made in natural language processing (specifically neural machine translation) and in computer vision processes, specifically image and video captioning, related methods can be further researched to boost automated sign language understanding. This is an especially challenging AI research area due to the involvement of a continuous visual-spatial modality, where meaning is often derived from context. To this end, this thesis is focused on the study and development of new computational methods and training mechanisms to enhance sign language translation in two directions, signs to texts and texts to signs. This work introduces a new, realistic phrase-level American Sign Language dataset (ASL/ ASLing), and investigates the role of different types of visual features (CNN embeddings, human body keypoints, and optical flow vectors) in translating ASL to spoken American English. Additionally, the research considers the role of multiple features for improved translation, via various fusion architectures. As an added benefit, with continuous sign language being challenging to segment, this work also explores the use of overlapping scaled visual segments, across the video, for simultaneously segmenting and translating signs. Finally, a quintessential interpreting agent not only understands sign language and translates to text, but also understands the text and translates to signs. Hence, to facilitate two-way sign language communication, i.e. visual sign to spoken language translation and spoken to visual sign language translation, a dual neural machine translation model, SignNet, is presented. Various training paradigms are investigated for improved translation, using SignNet. By exploiting the notion of similarity (and dissimilarity) of visual signs, a metric embedding learning process proved most useful in training SignNet. The resulting processes outperformed their state-of-the-art counterparts by showing noteworthy improvements in BLEU 1 - BLEU 4 scores

RIT Scholar Works