138 research outputs found
Visual units and confusion modelling for automatic lip-reading
Automatic lip-reading (ALR) is a challenging task because the visual speech signal is known to be missing some important information, such as voicing. We propose an approach to ALR that acknowledges that this information is missing but assumes that it is substituted or deleted in a systematic way that can be modelled. We describe a system that learns such a model and then incorporates it into decoding, which is realised as a cascade of weighted finite-state transducers. Our results show a small but statistically significant improvement in recognition accuracy. We also investigate the issue of suitable visual units for ALR, and show that visemes are sub-optimal, not but because they introduce lexical ambiguity, but because the reduction in modelling units entailed by their use reduces accuracy
The Effect of Speaking Rate on Audio and Visual Speech
The speed that an utterance is spoken affects both the duration of the speech and the position of the articulators. Consequently, the sounds that are produced are modified, as are the position and appearance of the lips, teeth, tongue and other visible articulators. We describe an experiment designed to measure the effect of variable speaking rate on audio and visual speech by comparing sequences of phonemes and dynamic visemes appearing in the same sentences spoken at different speeds. We find that both audio and visual speech production are affected by varying the rate of speech, however, the effect is significantly more prominent in visual speech
A Mouth Full of Words: Visually Consistent Acoustic Redubbing
This paper introduces a method for automatic redubbing of video that exploits the many-to-many mapping of phoneme sequences to lip movements modelled as dynamic visemes [1]. For a given utterance, the corresponding dynamic viseme sequence is sampled to construct a graph of possible phoneme sequences that synchronize with the video. When composed with a pronunciation dictionary and language model, this produces a vast number of word sequences that are in sync with the original video, literally putting plausible words into the mouth of the speaker. We demonstrate that traditional, one-to-many, static visemes lack flexibility for this application as they produce significantly fewer word sequences. This work explores the natural ambiguity in visual speech and offers insight for automatic speech recognition and the importance of language modeling
Indiana\u27s Reward-for-Effort School Funding Formula: Issues and Options
Indiana is in the fourth year of a scheduled six-year phase in of its guaranteed yield Reward-for-Effort School Funding Formula
Mirroring to Build Trust in Digital Assistants
We describe experiments towards building a conversational digital assistant
that considers the preferred conversational style of the user. In particular,
these experiments are designed to measure whether users prefer and trust an
assistant whose conversational style matches their own. To this end we
conducted a user study where subjects interacted with a digital assistant that
responded in a way that either matched their conversational style, or did not.
Using self-reported personality attributes and subjects' feedback on the
interactions, we built models that can reliably predict a user's preferred
conversational style.Comment: Preprin
Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards
Preference-based reinforcement learning (PbRL) aligns a robot behavior with
human preferences via a reward function learned from binary feedback over agent
behaviors. We show that dynamics-aware reward functions improve the sample
efficiency of PbRL by an order of magnitude. In our experiments we iterate
between: (1) learning a dynamics-aware state-action representation (z^{sa}) via
a self-supervised temporal consistency task, and (2) bootstrapping the
preference-based reward function from (z^{sa}), which results in faster policy
learning and better final policy performance. For example, on quadruped-walk,
walker-walk, and cheetah-run, with 50 preference labels we achieve the same
performance as existing approaches with 500 preference labels, and we recover
83\% and 66\% of ground truth reward policy performance versus only 38\% and
21\%. The performance gains demonstrate the benefits of explicitly learning a
dynamics-aware reward model. Repo: \texttt{https://github.com/apple/ml-reed}.Comment: CoRL 2023. arXiv admin note: substantial text overlap with
arXiv:2211.0652
REALM: Robust Entropy Adaptive Loss Minimization for Improved Single-Sample Test-Time Adaptation
Fully-test-time adaptation (F-TTA) can mitigate performance loss due to
distribution shifts between train and test data (1) without access to the
training data, and (2) without knowledge of the model training procedure. In
online F-TTA, a pre-trained model is adapted using a stream of test samples by
minimizing a self-supervised objective, such as entropy minimization. However,
models adapted with online using entropy minimization, are unstable especially
in single sample settings, leading to degenerate solutions, and limiting the
adoption of TTA inference strategies. Prior works identify noisy, or
unreliable, samples as a cause of failure in online F-TTA. One solution is to
ignore these samples, which can lead to bias in the update procedure, slow
adaptation, and poor generalization. In this work, we present a general
framework for improving robustness of F-TTA to these noisy samples, inspired by
self-paced learning and robust loss functions. Our proposed approach, Robust
Entropy Adaptive Loss Minimization (REALM), achieves better adaptation accuracy
than previous approaches throughout the adaptation process on corruptions of
CIFAR-10 and ImageNet-1K, demonstrating its effectiveness.Comment: Accepted at WACV 2024, 17 pages, 7 figures, 11 table
- …