7 research outputs found
Video-based Sign Language Recognition without Temporal Segmentation
Millions of hearing impaired people around the world routinely use some
variants of sign languages to communicate, thus the automatic translation of a
sign language is meaningful and important. Currently, there are two
sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that
recognizes word by word and continuous SLR that translates entire sentences.
Existing continuous SLR methods typically utilize isolated SLRs as building
blocks, with an extra layer of preprocessing (temporal segmentation) and
another layer of post-processing (sentence synthesis). Unfortunately, temporal
segmentation itself is non-trivial and inevitably propagates errors into
subsequent steps. Worse still, isolated SLR methods typically require strenuous
labeling of each word separately in a sentence, severely limiting the amount of
attainable training data. To address these challenges, we propose a novel
continuous sign recognition framework, the Hierarchical Attention Network with
Latent Space (LS-HAN), which eliminates the preprocessing of temporal
segmentation. The proposed LS-HAN consists of three components: a two-stream
Convolutional Neural Network (CNN) for video feature representation generation,
a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention
Network (HAN) for latent space based recognition. Experiments are carried out
on two large scale datasets. Experimental results demonstrate the effectiveness
of the proposed framework.Comment: 32nd AAAI Conference on Artificial Intelligence (AAAI-18), Feb. 2-7,
2018, New Orleans, Louisiana, US
Fully Convolutional Networks for Continuous Sign Language Recognition
Continuous sign language recognition (SLR) is a challenging task that
requires learning on both spatial and temporal dimensions of signing frame
sequences. Most recent work accomplishes this by using CNN and RNN hybrid
networks. However, training these networks is generally non-trivial, and most
of them fail in learning unseen sequence patterns, causing an unsatisfactory
performance for online recognition. In this paper, we propose a fully
convolutional network (FCN) for online SLR to concurrently learn spatial and
temporal features from weakly annotated video sequences with only
sentence-level annotations given. A gloss feature enhancement (GFE) module is
introduced in the proposed network to enforce better sequence alignment
learning. The proposed network is end-to-end trainable without any
pre-training. We conduct experiments on two large scale SLR datasets.
Experiments show that our method for continuous SLR is effective and performs
well in online recognition.Comment: Accepted to ECCV202
Better Sign Language Translation with STMC-Transformer
Sign Language Translation (SLT) first uses a Sign Language Recognition (SLR)
system to extract sign language glosses from videos. Then, a translation system
generates spoken language translations from the sign language glosses. This
paper focuses on the translation system and introduces the STMC-Transformer
which improves on the current state-of-the-art by over 5 and 7 BLEU
respectively on gloss-to-text and video-to-text translation of the
PHOENIX-Weather 2014T dataset. On the ASLG-PC12 corpus, we report an increase
of over 16 BLEU.
We also demonstrate the problem in current methods that rely on gloss
supervision. The video-to-text translation of our STMC-Transformer outperforms
translation of GT glosses. This contradicts previous claims that GT gloss
translation acts as an upper bound for SLT performance and reveals that glosses
are an inefficient representation of sign language. For future SLT research, we
therefore suggest an end-to-end training of the recognition and translation
models, or using a different sign language annotation scheme.Comment: Proceedings of the 28th International Conference on Computational
Linguistics (COLING'2020
GCTW Alignment for isolated gesture recognition
In recent years, there has been increasing interest in developing automatic Sign Language Recognition (SLR) systems because Sign Language (SL) is the main mode of communication between deaf people all over the world. However, most people outside the deaf community do not understand SL, generating a communication problem, between both communities. Recognizing signs is a challenging problem because manual signing (not taking into account facial gestures) has four components that have to be recognized, namely, handshape, movement, location and palm orientation. Even though the appearance and meaning of basic signs are well-defined in sign language dictionaries, in practice, many variations arise due to different factors like gender, age, education or regional, social and ethnic factors which can lead to significant variations making hard to develop a robust SL recognition system. This project attempts to introduce the alignment of videos into isolated SLR, given that this approach has not been studied deeply, even though it presents a great potential for correctly recognize isolated gestures. We also aim for a user-independent recognition, which means that the system should give have a good recognition accuracy for the signers that were not represented in the data set. The main features used for the alignment are the wrists coordinates that we extracted from the videos by using OpenPose. These features will be aligned by using Generalized Canonical Time Warping. The resultant videos will be classified by making use of a 3D CNN. Our experimental results show that the proposed method has obtained a 65.02% accuracy, which places us 5th in the 2017 Chalearn LAP isolated gesture recognition challenge, only 2.69% away from the first place.Trabajo de investigaci贸