2,560 research outputs found

    Visual Speech Recognition using Histogram of Oriented Displacements

    Get PDF
    Lip reading is the recognition of spoken words from the visual information of lips. It has been of considerable interest in the Computer Vision and Speech Recognition communities to automate this process using computer algorithms. In this thesis, we have developed a novel method involving describing visual features using fixed length descriptors called Histogram of Oriented Displacements to which we apply Support Vector Machines for recognition of spoken words. Using this method on the CUAVE database we have achieved a recognition rate of 81%

    Lip-reading with Densely Connected Temporal Convolutional Networks

    Full text link
    In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words. Although Temporal Convolutional Networks (TCN) have recently demonstrated great potential in many vision tasks, its receptive fields are not dense enough to model the complex temporal dynamics in lip-reading scenarios. To address this problem, we introduce dense connections into the network to capture more robust temporal features. Moreover, our approach utilises the Squeeze-and-Excitation block, a light-weight attention mechanism, to further enhance the model's classification power. Without bells and whistles, our DC-TCN method has achieved 88.36% accuracy on the Lip Reading in the Wild (LRW) dataset and 43.65% on the LRW-1000 dataset, which has surpassed all the baseline methods and is the new state-of-the-art on both datasets.Comment: WACV 202

    Perception of categories: from coding efficiency to reaction times

    Full text link
    Reaction-times in perceptual tasks are the subject of many experimental and theoretical studies. With the neural decision making process as main focus, most of these works concern discrete (typically binary) choice tasks, implying the identification of the stimulus as an exemplar of a category. Here we address issues specific to the perception of categories (e.g. vowels, familiar faces, ...), making a clear distinction between identifying a category (an element of a discrete set) and estimating a continuous parameter (such as a direction). We exhibit a link between optimal Bayesian decoding and coding efficiency, the latter being measured by the mutual information between the discrete category set and the neural activity. We characterize the properties of the best estimator of the likelihood of the category, when this estimator takes its inputs from a large population of stimulus-specific coding cells. Adopting the diffusion-to-bound approach to model the decisional process, this allows to relate analytically the bias and variance of the diffusion process underlying decision making to macroscopic quantities that are behaviorally measurable. A major consequence is the existence of a quantitative link between reaction times and discrimination accuracy. The resulting analytical expression of mean reaction times during an identification task accounts for empirical facts, both qualitatively (e.g. more time is needed to identify a category from a stimulus at the boundary compared to a stimulus lying within a category), and quantitatively (working on published experimental data on phoneme identification tasks)

    Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

    Full text link
    In this work, we propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training. Impressive progress in the domain of speech recognition has been exhibited by audio and audio-visual systems. Nevertheless, there is still much to be explored with regards to visual speech recognition systems due to the visual ambiguity of some phonemes. To this end, the development of visual speech recognition models is crucial given the instability of audio models. The main contributions of this work are i) building on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems; ii) leveraging audio data during training visual models, a feat which has not been utilized in prior word-based work; iii) proposing the Gaussian-shaped averaging in frame-level KD, as an efficient technique that aids the model in distilling knowledge at the sequence model encoder. This work proposes a novel and competitive architecture for lip-reading, as we demonstrate a noticeable improvement in performance, setting a new benchmark equals to 88.64% on the LRW dataset.Comment: arXiv admin note: text overlap with arXiv:2108.0354
    corecore