451 research outputs found
Sign language recognition with transformer networks
Sign languages are complex languages. Research into them is ongoing, supported by large video corpora of which only small parts are annotated. Sign language recognition can be used to speed up the annotation process of these corpora, in order to aid research into sign languages and sign language recognition. Previous research has approached sign language recognition in various ways, using feature extraction techniques or end-to-end deep learning. In this work, we apply a combination of feature extraction using OpenPose for human keypoint estimation and end-to-end feature learning with Convolutional Neural Networks. The proven multi-head attention mechanism used in transformers is applied to recognize isolated signs in the Flemish Sign Language corpus. Our proposed method significantly outperforms the previous state of the art of sign language recognition on the Flemish Sign Language corpus: we obtain an accuracy of 74.7% on a vocabulary of 100 classes. Our results will be implemented as a suggestion system for sign language corpus annotation
Towards automatic sign language corpus annotation using deep learning
Sign classification in sign language corpora is a challenging problem that requires large datasets. Unfortunately, only a small portion of those corpora is labeled. To expedite the annotation process, we propose a gloss suggestion system based on deep learning. We improve upon previous research in three ways. Firstly, we use a proven feature extraction method called OpenPose, rather than learning end-to-end. Secondly, we propose a more suitable and powerful network architecture, based on GRU layers. Finally, we exploit domain and task knowledge to further increase the accuracy.
We show that we greatly outperform the previous state of the art on the used dataset. Our method can be used for suggesting a top 5 of annotations given a video fragment that is selected by the corpus annotator. We expect that it will expedite the annotation process to the benefit of sign language translation research
Recurrent Human Pose Estimation
We propose a novel ConvNet model for predicting 2D human body poses in an
image. The model regresses a heatmap representation for each body keypoint, and
is able to learn and represent both the part appearances and the context of the
part configuration. We make the following three contributions: (i) an
architecture combining a feed forward module with a recurrent module, where the
recurrent module can be run iteratively to improve the performance, (ii) the
model can be trained end-to-end and from scratch, with auxiliary losses
incorporated to improve performance, (iii) we investigate whether keypoint
visibility can also be predicted. The model is evaluated on two benchmark
datasets. The result is a simple architecture that achieves performance on par
with the state of the art, but without the complexity of a graphical model
stage (or layers).Comment: FG 2017, More Info and Demo:
http://www.robots.ox.ac.uk/~vgg/software/keypoint_detection
Two-Stream Network for Sign Language Recognition and Translation
Sign languages are visual languages using manual articulations and non-manual
elements to convey information. For sign language recognition and translation,
the majority of existing approaches directly encode RGB videos into hidden
representations. RGB videos, however, are raw signals with substantial visual
redundancy, leading the encoder to overlook the key information for sign
language understanding. To mitigate this problem and better incorporate domain
knowledge, such as handshape and body movement, we introduce a dual visual
encoder containing two separate streams to model both the raw videos and the
keypoint sequences generated by an off-the-shelf keypoint estimator. To make
the two streams interact with each other, we explore a variety of techniques,
including bidirectional lateral connection, sign pyramid network with auxiliary
supervision, and frame-level self-distillation. The resulting model is called
TwoStream-SLR, which is competent for sign language recognition (SLR).
TwoStream-SLR is extended to a sign language translation (SLT) model,
TwoStream-SLT, by simply attaching an extra translation network.
Experimentally, our TwoStream-SLR and TwoStream-SLT achieve state-of-the-art
performance on SLR and SLT tasks across a series of datasets including
Phoenix-2014, Phoenix-2014T, and CSL-Daily.Comment: Accepted by NeurIPS 202
Towards the extraction of robust sign embeddings for low resource sign language recognition
Isolated Sign Language Recognition (SLR) has mostly been applied on datasets
containing signs executed slowly and clearly by a limited group of signers. In
real-world scenarios, however, we are met with challenging visual conditions,
coarticulated signing, small datasets, and the need for signer independent
models. To tackle this difficult problem, we require a robust feature extractor
to process the sign language videos. One could expect human pose estimators to
be ideal candidates. However, due to a domain mismatch with their training sets
and challenging poses in sign language, they lack robustness on sign language
data and image-based models often still outperform keypoint-based models.
Furthermore, whereas the common practice of transfer learning with image-based
models yields even higher accuracy, keypoint-based models are typically trained
from scratch on every SLR dataset. These factors limit their usefulness for
SLR. From the existing literature, it is also not clear which, if any, pose
estimator performs best for SLR. We compare the three most popular pose
estimators for SLR: OpenPose, MMPose and MediaPipe. We show that through
keypoint normalization, missing keypoint imputation, and learning a pose
embedding, we can obtain significantly better results and enable transfer
learning. We show that keypoint-based embeddings contain cross-lingual
features: they can transfer between sign languages and achieve competitive
performance even when fine-tuning only the classifier layer of an SLR model on
a target sign language. We furthermore achieve better performance using
fine-tuned transferred embeddings than models trained only on the target sign
language. The embeddings can also be learned in a multilingual fashion. The
application of these embeddings could prove particularly useful for low
resource sign languages in the future
- …