3,369 research outputs found
Sign language recognition with transformer networks
Sign languages are complex languages. Research into them is ongoing, supported by large video corpora of which only small parts are annotated. Sign language recognition can be used to speed up the annotation process of these corpora, in order to aid research into sign languages and sign language recognition. Previous research has approached sign language recognition in various ways, using feature extraction techniques or end-to-end deep learning. In this work, we apply a combination of feature extraction using OpenPose for human keypoint estimation and end-to-end feature learning with Convolutional Neural Networks. The proven multi-head attention mechanism used in transformers is applied to recognize isolated signs in the Flemish Sign Language corpus. Our proposed method significantly outperforms the previous state of the art of sign language recognition on the Flemish Sign Language corpus: we obtain an accuracy of 74.7% on a vocabulary of 100 classes. Our results will be implemented as a suggestion system for sign language corpus annotation
Hidden Two-Stream Convolutional Networks for Action Recognition
Analyzing videos of human actions involves understanding the temporal
relationships among video frames. State-of-the-art action recognition
approaches rely on traditional optical flow estimation methods to pre-compute
motion information for CNNs. Such a two-stage approach is computationally
expensive, storage demanding, and not end-to-end trainable. In this paper, we
present a novel CNN architecture that implicitly captures motion information
between adjacent frames. We name our approach hidden two-stream CNNs because it
only takes raw video frames as input and directly predicts action classes
without explicitly computing optical flow. Our end-to-end approach is 10x
faster than its two-stage baseline. Experimental results on four challenging
action recognition datasets: UCF101, HMDB51, THUMOS14 and ActivityNet v1.2 show
that our approach significantly outperforms the previous best real-time
approaches.Comment: Accepted at ACCV 2018, camera ready. Code available at
https://github.com/bryanyzhu/Hidden-Two-Strea
Deep Audio-Visual Speech Recognition
The goal of this work is to recognise phrases and sentences being spoken by a
talking face, with or without the audio. Unlike previous works that have
focussed on recognising a limited number of words or phrases, we tackle lip
reading as an open-world problem - unconstrained natural language sentences,
and in the wild videos. Our key contributions are: (1) we compare two models
for lip reading, one using a CTC loss, and the other using a
sequence-to-sequence loss. Both models are built on top of the transformer
self-attention architecture; (2) we investigate to what extent lip reading is
complementary to audio speech recognition, especially when the audio signal is
noisy; (3) we introduce and publicly release a new dataset for audio-visual
speech recognition, LRS2-BBC, consisting of thousands of natural sentences from
British television. The models that we train surpass the performance of all
previous work on a lip reading benchmark dataset by a significant margin.Comment: Accepted for publication by IEEE Transactions on Pattern Analysis and
Machine Intelligenc
- …