239 research outputs found
Large-scale learning of sign language by watching TV
The goal of this work is to automatically learn a large number of signs from sign language-interpreted TV broadcasts. We achieve this by exploiting supervisory information available in the subtitles of the broadcasts. However, this information is both weak and noisy and this leads to a challenging correspondence problem when trying to identify the temporal window of the sign.
We make the following contributions: (i) we show that, somewhat counter-intuitively, mouth patterns are highly informative for isolating words in a language for the Deaf, and their co-occurrence with signing can be used to significantly reduce the correspondence search space; and (ii) we develop a multiple instance learning method using an efficient discriminative search, which determines a candidate list for the sign with both high recall and precision.
We demonstrate the method on videos from BBC TV broadcasts, and achieve higher accuracy and recall than previous methods, despite using much simpler features
Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages?
We introduce the problem of zero-shot sign language recognition (ZSSLR),
where the goal is to leverage models learned over the seen sign class examples
to recognize the instances of unseen signs. To this end, we propose to utilize
the readily available descriptions in sign language dictionaries as an
intermediate-level semantic representation for knowledge transfer. We introduce
a new benchmark dataset called ASL-Text that consists of 250 sign language
classes and their accompanying textual descriptions. Compared to the ZSL
datasets in other domains (such as object recognition), our dataset consists of
limited number of training examples for a large number of classes, which
imposes a significant challenge. We propose a framework that operates over the
body and hand regions by means of 3D-CNNs, and models longer temporal
relationships via bidirectional LSTMs. By leveraging the descriptive text
embeddings along with these spatio-temporal representations within a zero-shot
learning framework, we show that textual data can indeed be useful in
uncovering sign languages. We anticipate that the introduced approach and the
accompanying dataset will provide a basis for further exploration of this new
zero-shot learning problem.Comment: To appear in British Machine Vision Conference (BMVC) 201
Posterior Regularization for Structured Latent Varaible Models
We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment
BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues
Recent progress in fine-grained gesture and action classification, and
machine translation, point to the possibility of automated sign language
recognition becoming a reality. A key stumbling block in making progress
towards this goal is a lack of appropriate training data, stemming from the
high complexity of sign annotation and a limited supply of qualified
annotators. In this work, we introduce a new scalable approach to data
collection for sign recognition in continuous videos. We make use of
weakly-aligned subtitles for broadcast footage together with a keyword spotting
method to automatically localise sign-instances for a vocabulary of 1,000 signs
in 1,000 hours of video. We make the following contributions: (1) We show how
to use mouthing cues from signers to obtain high-quality annotations from video
data - the result is the BSL-1K dataset, a collection of British Sign Language
(BSL) signs of unprecedented scale; (2) We show that we can use BSL-1K to train
strong sign recognition models for co-articulated signs in BSL and that these
models additionally form excellent pretraining for other sign languages and
benchmarks - we exceed the state of the art on both the MSASL and WLASL
benchmarks. Finally, (3) we propose new large-scale evaluation sets for the
tasks of sign recognition and sign spotting and provide baselines which we hope
will serve to stimulate research in this area.Comment: Appears in: European Conference on Computer Vision 2020 (ECCV 2020).
28 page
Sign Language Recognition
This chapter covers the key aspects of sign-language recognition (SLR), starting with a brief introduction to the motivations and requirements, followed by a précis of sign linguistics and their impact on the field. The types of data available and the relative merits are explored allowing examination of the features which can be extracted. Classifying the manual aspects of sign (similar to gestures) is then discussed from a tracking and non-tracking viewpoint before summarising some of the approaches to the non-manual aspects of sign languages. Methods for combining the sign classification results into full SLR are given showing the progression towards speech recognition techniques and the further adaptations required for the sign specific case. Finally the current frontiers are discussed and the recent research presented. This covers the task of continuous sign recognition, the work towards true signer independence, how to effectively combine the different modalities of sign, making use of the current linguistic research and adapting to larger more noisy data set
Aligning Neural Machine Translation Models: Human Feedback in Training and Inference
Reinforcement learning from human feedback (RLHF) is a recent technique to
improve the quality of the text generated by a language model, making it closer
to what humans would generate. A core ingredient in RLHF's success in aligning
and improving large language models (LLMs) is its reward model, trained using
human feedback on model outputs. In machine translation (MT), where metrics
trained from human annotations can readily be used as reward models, recent
methods using minimum Bayes risk decoding and reranking have succeeded in
improving the final quality of translation. In this study, we comprehensively
explore and compare techniques for integrating quality metrics as reward models
into the MT pipeline. This includes using the reward model for data filtering,
during the training phase through RL, and at inference time by employing
reranking techniques, and we assess the effects of combining these in a unified
approach. Our experimental results, conducted across multiple translation
tasks, underscore the crucial role of effective data filtering, based on
estimated quality, in harnessing the full potential of RL in enhancing MT
quality. Furthermore, our findings demonstrate the effectiveness of combining
RL training with reranking techniques, showcasing substantial improvements in
translation quality.Comment: 14 pages, work-in-progres
Fully Convolutional Networks for Continuous Sign Language Recognition
Continuous sign language recognition (SLR) is a challenging task that
requires learning on both spatial and temporal dimensions of signing frame
sequences. Most recent work accomplishes this by using CNN and RNN hybrid
networks. However, training these networks is generally non-trivial, and most
of them fail in learning unseen sequence patterns, causing an unsatisfactory
performance for online recognition. In this paper, we propose a fully
convolutional network (FCN) for online SLR to concurrently learn spatial and
temporal features from weakly annotated video sequences with only
sentence-level annotations given. A gloss feature enhancement (GFE) module is
introduced in the proposed network to enforce better sequence alignment
learning. The proposed network is end-to-end trainable without any
pre-training. We conduct experiments on two large scale SLR datasets.
Experiments show that our method for continuous SLR is effective and performs
well in online recognition.Comment: Accepted to ECCV202
- …