2,560 research outputs found
Visual Speech Recognition using Histogram of Oriented Displacements
Lip reading is the recognition of spoken words from the visual information of lips. It has been of considerable interest in the Computer Vision and Speech Recognition communities to automate this process using computer algorithms. In this thesis, we have developed a novel method involving describing visual features using fixed length descriptors called Histogram of Oriented Displacements to which we apply Support Vector Machines for recognition of spoken words. Using this method on the CUAVE database we have achieved a recognition rate of 81%
Lip-reading with Densely Connected Temporal Convolutional Networks
In this work, we present the Densely Connected Temporal Convolutional Network
(DC-TCN) for lip-reading of isolated words. Although Temporal Convolutional
Networks (TCN) have recently demonstrated great potential in many vision tasks,
its receptive fields are not dense enough to model the complex temporal
dynamics in lip-reading scenarios. To address this problem, we introduce dense
connections into the network to capture more robust temporal features.
Moreover, our approach utilises the Squeeze-and-Excitation block, a
light-weight attention mechanism, to further enhance the model's classification
power. Without bells and whistles, our DC-TCN method has achieved 88.36%
accuracy on the Lip Reading in the Wild (LRW) dataset and 43.65% on the
LRW-1000 dataset, which has surpassed all the baseline methods and is the new
state-of-the-art on both datasets.Comment: WACV 202
MIR-GAN : Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
PreprintPublisher PD
Perception of categories: from coding efficiency to reaction times
Reaction-times in perceptual tasks are the subject of many experimental and
theoretical studies. With the neural decision making process as main focus,
most of these works concern discrete (typically binary) choice tasks, implying
the identification of the stimulus as an exemplar of a category. Here we
address issues specific to the perception of categories (e.g. vowels, familiar
faces, ...), making a clear distinction between identifying a category (an
element of a discrete set) and estimating a continuous parameter (such as a
direction). We exhibit a link between optimal Bayesian decoding and coding
efficiency, the latter being measured by the mutual information between the
discrete category set and the neural activity. We characterize the properties
of the best estimator of the likelihood of the category, when this estimator
takes its inputs from a large population of stimulus-specific coding cells.
Adopting the diffusion-to-bound approach to model the decisional process, this
allows to relate analytically the bias and variance of the diffusion process
underlying decision making to macroscopic quantities that are behaviorally
measurable. A major consequence is the existence of a quantitative link between
reaction times and discrimination accuracy. The resulting analytical expression
of mean reaction times during an identification task accounts for empirical
facts, both qualitatively (e.g. more time is needed to identify a category from
a stimulus at the boundary compared to a stimulus lying within a category), and
quantitatively (working on published experimental data on phoneme
identification tasks)
Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models
In this work, we propose a technique to transfer speech recognition
capabilities from audio speech recognition systems to visual speech
recognizers, where our goal is to utilize audio data during lipreading model
training. Impressive progress in the domain of speech recognition has been
exhibited by audio and audio-visual systems. Nevertheless, there is still much
to be explored with regards to visual speech recognition systems due to the
visual ambiguity of some phonemes. To this end, the development of visual
speech recognition models is crucial given the instability of audio models. The
main contributions of this work are i) building on recent state-of-the-art
word-based lipreading models by integrating sequence-level and frame-level
Knowledge Distillation (KD) to their systems; ii) leveraging audio data during
training visual models, a feat which has not been utilized in prior word-based
work; iii) proposing the Gaussian-shaped averaging in frame-level KD, as an
efficient technique that aids the model in distilling knowledge at the sequence
model encoder. This work proposes a novel and competitive architecture for
lip-reading, as we demonstrate a noticeable improvement in performance, setting
a new benchmark equals to 88.64% on the LRW dataset.Comment: arXiv admin note: text overlap with arXiv:2108.0354
- …