Search CORE

2,560 research outputs found

Visual Speech Recognition using Histogram of Oriented Displacements

Author: Kumaravel Sujeeth Selvam
Publication venue: Clemson University Libraries
Publication date: 01/05/2015
Field of study

Lip reading is the recognition of spoken words from the visual information of lips. It has been of considerable interest in the Computer Vision and Speech Recognition communities to automate this process using computer algorithms. In this thesis, we have developed a novel method involving describing visual features using fixed length descriptors called Histogram of Oriented Displacements to which we apply Support Vector Machines for recognition of spoken words. Using this method on the CUAVE database we have achieved a recognition rate of 81%

Clemson University: TigerPrints

Lip-reading with Densely Connected Temporal Convolutional Networks

Author: Ma Pingchuan
Pantic Maja
Petridis Stavros
Shen Jie
Wang Yujiang
Publication venue
Publication date: 11/11/2020
Field of study

In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words. Although Temporal Convolutional Networks (TCN) have recently demonstrated great potential in many vision tasks, its receptive fields are not dense enough to model the complex temporal dynamics in lip-reading scenarios. To address this problem, we introduce dense connections into the network to capture more robust temporal features. Moreover, our approach utilises the Squeeze-and-Excitation block, a light-weight attention mechanism, to further enhance the model's classification power. Without bells and whistles, our DC-TCN method has achieved 88.36% accuracy on the Lip Reading in the Wild (LRW) dataset and 43.65% on the LRW-1000 dataset, which has surpassed all the baseline methods and is the new state-of-the-art on both datasets.Comment: WACV 202

arXiv.org e-Print Archive

MIR-GAN : Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition

Author: Chen Chen
Chng Eng Siong
Hu Yuchen
Li Ruizhe
Zou Heqing
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/07/2023
Field of study

PreprintPublisher PD

Aberdeen University Research

Perception of categories: from coding efficiency to reaction times

Author: Abbott
Abramson
Ashby
Ashby
Ashby
Ashby
Beale
Beck
Bialek
Bogacz
Bonnasse-Gahot
Bornstein
Britten
Brunel
Clarke
Cover
De Baene
Ecker
Freedman
Freedman
Freedman
Gold
Goldstone
Hallé
Haussler
Heekeren
Heeren
Herschkowitz
Huk
Jean-Pierre Nadal
Kim
Kriegeskorte
Kruschke
Kuhl
Kuhl
Kuhl
Laurent Bonnasse-Gahot
Liberman
Link
Link
McMurray
Meyers
Nosofsky
Ohl
Op de Beeck
Pisoni
Polka
Prather
Ratcliff
Ratcliff
Renart
Repp
Rissanen
Salinas
Seung
Shadlen
Shadlen
Sigala
Smith
Studdert-Kennedy
Usher
Vickers
Wald
Werker
Xu
Ylinen
Yoon
Zohary
Publication venue: 'Elsevier BV'
Publication date: 23/02/2011
Field of study

Reaction-times in perceptual tasks are the subject of many experimental and theoretical studies. With the neural decision making process as main focus, most of these works concern discrete (typically binary) choice tasks, implying the identification of the stimulus as an exemplar of a category. Here we address issues specific to the perception of categories (e.g. vowels, familiar faces, ...), making a clear distinction between identifying a category (an element of a discrete set) and estimating a continuous parameter (such as a direction). We exhibit a link between optimal Bayesian decoding and coding efficiency, the latter being measured by the mutual information between the discrete category set and the neural activity. We characterize the properties of the best estimator of the likelihood of the category, when this estimator takes its inputs from a large population of stimulus-specific coding cells. Adopting the diffusion-to-bound approach to model the decisional process, this allows to relate analytically the bias and variance of the diffusion process underlying decision making to macroscopic quantities that are behaviorally measurable. A major consequence is the existence of a quantitative link between reaction times and discrimination accuracy. The resulting analytical expression of mean reaction times during an identification task accounts for empirical facts, both qualitatively (e.g. more time is needed to identify a category from a stimulus at the boundary compared to a stimulus lying within a category), and quantitatively (working on published experimental data on phoneme identification tasks)

arXiv.org e-Print Archive

Crossref

Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

Author: Abugabal Omar
Eraqi Hesham M.
Mabrouk Hadeel
Sakr Nourhan
Publication venue
Publication date: 05/06/2022
Field of study

In this work, we propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training. Impressive progress in the domain of speech recognition has been exhibited by audio and audio-visual systems. Nevertheless, there is still much to be explored with regards to visual speech recognition systems due to the visual ambiguity of some phonemes. To this end, the development of visual speech recognition models is crucial given the instability of audio models. The main contributions of this work are i) building on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems; ii) leveraging audio data during training visual models, a feat which has not been utilized in prior word-based work; iii) proposing the Gaussian-shaped averaging in frame-level KD, as an efficient technique that aids the model in distilling knowledge at the sequence model encoder. This work proposes a novel and competitive architecture for lip-reading, as we demonstrate a noticeable improvement in performance, setting a new benchmark equals to 88.64% on the LRW dataset.Comment: arXiv admin note: text overlap with arXiv:2108.0354

arXiv.org e-Print Archive