80 research outputs found
The Effect of Speaking Rate on Audio and Visual Speech
The speed that an utterance is spoken affects both the duration of the speech and the position of the articulators. Consequently, the sounds that are produced are modified, as are the position and appearance of the lips, teeth, tongue and other visible articulators. We describe an experiment designed to measure the effect of variable speaking rate on audio and visual speech by comparing sequences of phonemes and dynamic visemes appearing in the same sentences spoken at different speeds. We find that both audio and visual speech production are affected by varying the rate of speech, however, the effect is significantly more prominent in visual speech
A Mouth Full of Words: Visually Consistent Acoustic Redubbing
This paper introduces a method for automatic redubbing of video that exploits the many-to-many mapping of phoneme sequences to lip movements modelled as dynamic visemes [1]. For a given utterance, the corresponding dynamic viseme sequence is sampled to construct a graph of possible phoneme sequences that synchronize with the video. When composed with a pronunciation dictionary and language model, this produces a vast number of word sequences that are in sync with the original video, literally putting plausible words into the mouth of the speaker. We demonstrate that traditional, one-to-many, static visemes lack flexibility for this application as they produce significantly fewer word sequences. This work explores the natural ambiguity in visual speech and offers insight for automatic speech recognition and the importance of language modeling
Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards
Preference-based reinforcement learning (PbRL) aligns a robot behavior with
human preferences via a reward function learned from binary feedback over agent
behaviors. We show that dynamics-aware reward functions improve the sample
efficiency of PbRL by an order of magnitude. In our experiments we iterate
between: (1) learning a dynamics-aware state-action representation (z^{sa}) via
a self-supervised temporal consistency task, and (2) bootstrapping the
preference-based reward function from (z^{sa}), which results in faster policy
learning and better final policy performance. For example, on quadruped-walk,
walker-walk, and cheetah-run, with 50 preference labels we achieve the same
performance as existing approaches with 500 preference labels, and we recover
83\% and 66\% of ground truth reward policy performance versus only 38\% and
21\%. The performance gains demonstrate the benefits of explicitly learning a
dynamics-aware reward model. Repo: \texttt{https://github.com/apple/ml-reed}.Comment: CoRL 2023. arXiv admin note: substantial text overlap with
arXiv:2211.0652
Mirroring to Build Trust in Digital Assistants
We describe experiments towards building a conversational digital assistant
that considers the preferred conversational style of the user. In particular,
these experiments are designed to measure whether users prefer and trust an
assistant whose conversational style matches their own. To this end we
conducted a user study where subjects interacted with a digital assistant that
responded in a way that either matched their conversational style, or did not.
Using self-reported personality attributes and subjects' feedback on the
interactions, we built models that can reliably predict a user's preferred
conversational style.Comment: Preprin
REALM: Robust Entropy Adaptive Loss Minimization for Improved Single-Sample Test-Time Adaptation
Fully-test-time adaptation (F-TTA) can mitigate performance loss due to
distribution shifts between train and test data (1) without access to the
training data, and (2) without knowledge of the model training procedure. In
online F-TTA, a pre-trained model is adapted using a stream of test samples by
minimizing a self-supervised objective, such as entropy minimization. However,
models adapted with online using entropy minimization, are unstable especially
in single sample settings, leading to degenerate solutions, and limiting the
adoption of TTA inference strategies. Prior works identify noisy, or
unreliable, samples as a cause of failure in online F-TTA. One solution is to
ignore these samples, which can lead to bias in the update procedure, slow
adaptation, and poor generalization. In this work, we present a general
framework for improving robustness of F-TTA to these noisy samples, inspired by
self-paced learning and robust loss functions. Our proposed approach, Robust
Entropy Adaptive Loss Minimization (REALM), achieves better adaptation accuracy
than previous approaches throughout the adaptation process on corruptions of
CIFAR-10 and ImageNet-1K, demonstrating its effectiveness.Comment: Accepted at WACV 2024, 17 pages, 7 figures, 11 table
Some observations on computer lip-reading: moving from the dream to the reality
In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units called "visemes" for example). Here we review these and other assumptions and show the surprising result that computer lip-reading is not heavily constrained by video resolution, pose, lighting and other practical factors. However, the working assumption that visemes, which are the visual equivalent of phonemes, are the best unit for recognition does need further examination. We conclude that visemes, which were defined over a century ago, are unlikely to be optimal for a modern computer lip-reading system. © (2014) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only
Resolution limits on visual speech recognition
Visual-only speech recognition is dependent upon a number of factors that can be difficult to control, such as: lighting; identity; motion; emotion and expression. But some factors, such as video resolution are controllable, so it is surprising that there is not yet a systematic study of the effect of resolution on lip-reading. Here we use a new data set, the Rosetta Raven data, to train and test recognizers so we can measure the affect of video resolution on recognition accuracy. We conclude that, contrary to common practice, resolution need not be that great for automatic lip-reading. However it is highly unlikely that automatic lip-reading can work reliably when the distance between the bottom of the lower lip and the top of the upper lip is less than four pixels at rest
- …