2,768 research outputs found

    DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation

    Full text link
    There is an undeniable communication barrier between deaf people and people with normal hearing ability. Although innovations in sign language translation technology aim to tear down this communication barrier, the majority of existing sign language translation systems are either intrusive or constrained by resolution or ambient lighting conditions. Moreover, these existing systems can only perform single-sign ASL translation rather than sentence-level translation, making them much less useful in daily-life communication scenarios. In this work, we fill this critical gap by presenting DeepASL, a transformative deep learning-based sign language translation technology that enables ubiquitous and non-intrusive American Sign Language (ASL) translation at both word and sentence levels. DeepASL uses infrared light as its sensing mechanism to non-intrusively capture the ASL signs. It incorporates a novel hierarchical bidirectional deep recurrent neural network (HB-RNN) and a probabilistic framework based on Connectionist Temporal Classification (CTC) for word-level and sentence-level ASL translation respectively. To evaluate its performance, we have collected 7,306 samples from 11 participants, covering 56 commonly used ASL words and 100 ASL sentences. DeepASL achieves an average 94.5% word-level translation accuracy and an average 8.2% word error rate on translating unseen ASL sentences. Given its promising performance, we believe DeepASL represents a significant step towards breaking the communication barrier between deaf people and hearing majority, and thus has the significant potential to fundamentally change deaf people's lives

    Connectionist Temporal Modeling for Weakly Supervised Action Labeling

    Full text link
    We propose a weakly-supervised framework for action labeling in video, where only the order of occurring actions is required during training time. The key challenge is that the per-frame alignments between the input (video) and label (action) sequences are unknown during training. We address this by introducing the Extended Connectionist Temporal Classification (ECTC) framework to efficiently evaluate all possible alignments via dynamic programming and explicitly enforce their consistency with frame-to-frame visual similarities. This protects the model from distractions of visually inconsistent or degenerated alignments without the need of temporal supervision. We further extend our framework to the semi-supervised case when a few frames are sparsely annotated in a video. With less than 1% of labeled frames per video, our method is able to outperform existing semi-supervised approaches and achieve comparable performance to that of fully supervised approaches.Comment: To appear in ECCV 201

    Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition

    Get PDF
    Recently, the connectionist temporal classification (CTC) model coupled with recurrent (RNN) or convolutional neural networks (CNN), made it easier to train speech recognition systems in an end-to-end fashion. However in real-valued models, time frame components such as mel-filter-bank energies and the cepstral coefficients obtained from them, together with their first and second order derivatives, are processed as individual elements, while a natural alternative is to process such components as composed entities. We propose to group such elements in the form of quaternions and to process these quaternions using the established quaternion algebra. Quaternion numbers and quaternion neural networks have shown their efficiency to process multidimensional inputs as entities, to encode internal dependencies, and to solve many tasks with less learning parameters than real-valued models. This paper proposes to integrate multiple feature views in quaternion-valued convolutional neural network (QCNN), to be used for sequence-to-sequence mapping with the CTC model. Promising results are reported using simple QCNNs in phoneme recognition experiments with the TIMIT corpus. More precisely, QCNNs obtain a lower phoneme error rate (PER) with less learning parameters than a competing model based on real-valued CNNs.Comment: Accepted at INTERSPEECH 201

    Improving large vocabulary continuous speech recognition by combining GMM-based and reservoir-based acoustic modeling

    Get PDF
    In earlier work we have shown that good phoneme recognition is possible with a so-called reservoir, a special type of recurrent neural network. In this paper, different architectures based on Reservoir Computing (RC) for large vocabulary continuous speech recognition are investigated. Besides experiments with HMM hybrids, it is shown that a RC-HMM tandem can achieve the same recognition accuracy as a classical HMM, which is a promising result for such a fairly new paradigm. It is also demonstrated that a state-level combination of the scores of the tandem and the baseline HMM leads to a significant improvement over the baseline. A word error rate reduction of the order of 20\% relative is possible

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Deep Learning: Our Miraculous Year 1990-1991

    Full text link
    In 2020, we will celebrate that many of the basic ideas behind the deep learning revolution were published three decades ago within fewer than 12 months in our "Annus Mirabilis" or "Miraculous Year" 1990-1991 at TU Munich. Back then, few people were interested, but a quarter century later, neural networks based on these ideas were on over 3 billion devices such as smartphones, and used many billions of times per day, consuming a significant fraction of the world's compute.Comment: 37 pages, 188 references, based on work of 4 Oct 201

    Models of atypical development must also be models of normal development

    Get PDF
    Functional magnetic resonance imaging studies of developmental disorders and normal cognition that include children are becoming increasingly common and represent part of a newly expanding field of developmental cognitive neuroscience. These studies have illustrated the importance of the process of development in understanding brain mechanisms underlying cognition and including children ill the study of the etiology of developmental disorders
    corecore