28 research outputs found

    SMILE Swiss German Sign Language Dataset

    Full text link

    Neural Sign Language Recognition and Translation

    No full text
    Sign languages have been studied by computer vision researchers for the last threedecades. One of the end goals of vision-based sign language research is to build systemsthat can understand and translate sign languages to spoken/written languages or viceversa, to create a more natural medium of communication between the hearing and theDeaf. However, most research to date has mainly focused on isolated sign recognitionand spotting, neglecting the underlying rich grammatical and linguistic structures of signlanguage that differ from spoken language. More recently, Continuous Sign LanguageRecognition (CSLR) has become feasible with the availability of large benchmarkdatasets, such as the RWTH-PHOENIX-Weather-2014 Dataset (PHOENIX14), and thedevelopment of algorithms that can learn from weak annotations. Although, CSLRis able to recognize sign gloss sequences, further progress is required to producemeaningful spoken/written language interpretations of continuous sign language videos.In this thesis, we introduce the Sign Language Translation (SLT) problem and laygroundwork for future research on this topic. The objective of SLT is to generatespoken/written language translations from continuous sign language videos, taking intoaccount the different word orders and grammar. We evaluate our approaches on theRWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset, the first and the currentlyonly publicly available Continuous SLT dataset aimed at vision based sign languageresearch. It provides spoken language translations and gloss level annotations forGerman Sign Language videos of weather broadcasts. We lay down several evaluationprotocols to underpin future research in this newly established field.In the first contribution chapter of this thesis, we formalize SLT in the frameworkof Neural Machine Translation (NMT) and propose the first SLT approach, NeuralSign Language Translation. We combine Convolutional Neural Networks (CNNs) andattention-based encoder-decoder models, which allows us to jointly learn the spatialrepresentations, the underlying language model, and the mapping between sign andspoken language. We investigate different configurations of the proposed networkwith both end-to-end and pretrained settings (using expert gloss annotations). Inour experiments, recognizing glosses and then translating them to spoken languages(Sign2Gloss2Text) drastically outperforms an end-to-end direct translation approach(Sign2Text). Sign2Gloss2Text utilizes a state-of-the-art CSLR model to predict glosssequences from sign language videos and then solves SLT as text-to-text translationproblem. This suggests that using gloss level intermediate representations, essentiallydividing the process into two stages, is necessary to train accurate SLT models.Glosses are incomplete text-based representations of continuous multi-channel visualsignals, that are sign languages. Thus, the best performing two step configuration ofNeural Sign Language Translation has an inherent information bottleneck limitingtranslation. To address this issue, in the second contribution chapter of this thesis weformulate SLT as a multi-task learning problem. We introduce a novel transformerbased architecture, Sign Language Transformers, that jointly learn CSLR and SLT whilebeing trainable in an end-to-end manner. This is achieved by using a ConnectionistTemporal Classification (CTC) loss to bind the recognition and translation problemsinto a single unified architecture. This joint approach does not require any ground-truthtiming information, simultaneously solving two co-dependant sequence-to-sequencelearning problems and leads to significant performance gains. We report state-of-the-artCSLR and SLT results achieved by our Sign Language Transformers. Our translationnetworks outperform both sign video to spoken language and gloss to spoken languagetranslation models, in some cases more than doubling the performance of Neural SignLanguage Translation (Sign2Text configuration - 9.58 vs. 21.80 BLEU-4 Score).Models we introduce in both first and second contribution chapters heavily rely ongloss information, either in the form of direct supervision or for pretraining. To realizelarge scale sign language translation, that is on par with their spoken/written languagecounterparts, we require more parallel datasets. However, annotating sign glosses is alaborious task and acquiring such annotations for large datasets is infeasible. To addressthis issue, in our last contribution chapter we propose modelling SLT based on signarticulators instead of glosses. Contrary to previous research, which mainly focused onmanual features, we incorporate both both manual and non-manual features of the sign.We utilize hand shape, mouthings and upper body pose representations to model sign ina holistic manner.We propose a novel transformer based architecture, called Multi-Channel Transformers,aimed at sequence-to-sequence learning problems where the source information isembedded over several channels. This approach allows the networks to model both theinter and the intra relationship between asynchronous source channels. We also intro-duce a channel anchoring loss to help our models preserve channel specific informationwhile also regulating training against overfitting.We apply multi-channel transformers to the task of SLT and realize the first multi-articulatory translation approach. Our experiments on PHOENIX14T demonstrate thatour approach achieves on par or better translation performance against several baselines,overcoming the reliance on gloss information which underpin previous approaches.Now we have broken the dependency upon gloss information, future work will be toscale learning to larger datasets, such as broadcast footage, where gloss information isnot available

    Gated Variational AutoEncoders: Incorporating Weak Supervision to Encourage Disentanglement

    No full text
    Variational AutoEncoders (VAEs) provide a means to generate representational latent embeddings. Previous research has highlighted the benefits of achieving representations that are disentangled, particularly for downstream tasks. However, there is some debate about how to encourage disentanglement with VAEs, and evidence indicates that existing implementations do not achieve disentanglement consistently. The evaluation of how well a VAE’s latent space has been disentangled is often evaluated against our subjective expectations of which attributes should be disentangled for a given problem. Therefore, by definition, we already have domain knowledge of what should be achieved and yet we use unsupervised approaches to achieve it. We propose a weakly supervised approach that incorporates any available domain knowledge into the training process to form a Gated-VAE. The process involves partitioning the representational embedding and gating backpropagation. All partitions are utilised on the forward pass but gradients are backpropagated through different partitions according to selected image/target pairings. The approach can be used to modify existing VAE models such as beta-VAE, InfoVAE and DIP-VAE-II. Experiments demonstrate that using gated backpropagation, latent factors are represented in their intended partition. The approach is applied to images of faces for the purpose of disentangling head-pose from facial expression. Quantitative metrics show that using Gated-VAE improves average disentanglement, completeness and informativeness, as compared with un-gated implementations. Qualitative assessment of latent traversals demonstrate its disentanglement of head-pose from expression, even when only weak/noisy supervision is available

    Nested VAE:Isolating Common Factors via Weak Supervision

    No full text
    Fair and unbiased machine learning is an important and active field of research, as decision processes are increasingly driven by models that learn from data. Unfortunately, any biases present in the data may be learned by the model, thereby inappropriately transferring that bias into the decision making process. We identify the connection between the task of bias reduction and that of isolating factors common between domains whilst encouraging domain specific invariance. To isolate the common factors we combine the theory of deep latent variable models with information bottleneck theory for scenarios whereby data may be naturally paired across domains and no additional supervision is required. The result is the Nested Variational AutoEncoder (NestedVAE). Two outer VAEs with shared weights attempt to reconstruct the input and infer a latent space, whilst a nested VAE attempt store construct the latent representation of one image,from the latent representation of its paired image. In so doing,the nested VAE isolates the common latent factors/causes and becomes invariant to unwanted factors that are not shared between paired images. We also propose a new metric to provide a balanced method of evaluating consistency and classifier performance across domains which we refer to as the Adjusted Parity metric. An evaluation of Nested VAE on both domain and attribute invariance, change detection,and learning common factors for the prediction of biological sex demonstrates that NestedVAE significantly outperforms alternative methods

    Approximation of Ensemble Boundary using Spectral Coefficients

    Get PDF
    A spectral analysis of a Boolean function is proposed for ap- proximating the decision boundary of an ensemble of classifiers, and an in- tuitive explanation of computing Walsh coefficients for the functional ap- proximation is provided. It is shown that the difference between first and third order coefficient approximation is a good indicator of optimal base classifier complexity. When combining Neural Networks, experimental re- sults on a variety of artificial and real two-class problems demonstrate un- der what circumstances ensemble performance can be improved. For tuned base classifiers, first order coefficients provide performance similar to ma- jority vote. However, for weak/fast base classifiers, higher order coefficient approximation may give better performance. It is also shown that higher order coefficient approximation is superior to the Adaboost logarithmic weighting rule when boosting weak Decision Tree base classifiers

    Nested VAE:Isolating Common Factors via Weak Supervision

    No full text
    Fair and unbiased machine learning is an important and active field of research, as decision processes are increasingly driven by models that learn from data. Unfortunately, any biases present in the data may be learned by the model, thereby inappropriately transferring that bias into the decision making process. We identify the connection between the task of bias reduction and that of isolating factors common between domains whilst encouraging domain specific invariance. To isolate the common factors we combine the theory of deep latent variable models with information bottleneck theory for scenarios whereby data may be naturally paired across domains and no additional supervision is required. The result is the Nested Variational AutoEncoder (NestedVAE). Two outer VAEs with shared weights attempt to reconstruct the input and infer a latent space, whilst a nested VAE attempt store construct the latent representation of one image,from the latent representation of its paired image. In so doing,the nested VAE isolates the common latent factors/causes and becomes invariant to unwanted factors that are not shared between paired images. We also propose a new metric to provide a balanced method of evaluating consistency and classifier performance across domains which we refer to as the Adjusted Parity metric. An evaluation of Nested VAE on both domain and attribute invariance, change detection,and learning common factors for the prediction of biological sex demonstrates that NestedVAE significantly outperforms alternative methods

    Particle Filter Based Probabilistic Forced Alignment for Continuous Gesture Recognition

    No full text
    In this paper, we propose a novel particle filter based probabilistic forced alignment approach for training spatiotemporal deep neural networks using weak border level annotations. The proposed method jointly learns to localize and recognize isolated instances in continuous streams. This is done by drawing training volumes from a prior distribution of likely regions and training a discriminative 3D-CNN from this data. The classifier is then used to calculate the posterior distribution by scoring the training examples and using this as the prior for the next sampling stage. We apply the proposed approach to the challenging task of large-scale user-independent continuous gesture recognition. We evaluate the performance on the popular ChaLearn 2016 Continuous Gesture Recognition (ConGD) dataset. Our method surpasses state-of-the-art results by obtaining 0:3646 and 0:3744 Mean Jaccard Index Score on the validation and test sets of ConGD, respectively. Furthermore, we participated in the ChaLearn 2017 Continuous Gesture Recognition Challenge and was ranked 3rd. It should be noted that our method is learner independent, it can be easily combined with other approaches

    Adversarial Training for Multi-Channel Sign Language Production

    No full text
    Sign Languages are rich multi-channel languages, requiring articulation of both manual (hands) and non-manual (face and body) features in a precise, intricate manner. Sign Language Production (SLP), the automatic translation from spoken to sign languages, must embody this full sign morphology to be truly understandable by the Deaf community. Previous work has mainly focused on manual feature production, with an under-articulated output caused by regression to the mean. In this paper, we propose an Adversarial Multi-Channel approach to SLP. We frame sign production as a minimax game between a transformer-based Generator and a conditional Discriminator. Our adversarial discriminator evaluates the realism of sign production conditioned on the source text, pushing the generator towards a realistic and articulate output. Additionally, we fully encapsulate sign articulators with the inclusion of non-manual features, producing facial features and mouthing patterns. We evaluate on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset, and report state-of-the art SLP back-translation performance for manual production. We set new benchmarks for the production of multi-channel sign to underpin future research into realistic SLP

    Particle Filter based Probabilistic Forced Alignment for Continuous Gesture Recognition

    No full text
    In this paper, we propose a novel particle filter based probabilistic forced alignment approach for training spatiotemporal deep neural networks using weak border level annotations. The proposed method jointly learns to localize and recognize isolated instances in continuous streams. This is done by drawing training volumes from a prior distribution of likely regions and training a discriminative 3D-CNN from this data. The classifier is then used to calculate the posterior distribution by scoring the training examples and using this as the prior for the next sampling stage. We apply the proposed approach to the challenging task of large-scale user-independent continuous gesture recognition. We evaluate the performance on the popular ChaLearn 2016 Continuous Gesture Recognition (ConGD) dataset. Our method surpasses state-of-the-art results by obtaining 0:3646 and 0:3744 Mean Jaccard Index Score on the validation and test sets of ConGD, respectively. Furthermore, we participated in the ChaLearn 2017 Continuous Gesture Recognition Challenge and was ranked 3rd. It should be noted that our method is learner independent, it can be easily combined with other approaches
    corecore