12 research outputs found
Capture, Learning, and Synthesis of 3D Speaking Styles
Audio-driven 3D facial animation has been widely explored, but achieving
realistic, human-like performance is still unsolved. This is due to the lack of
available 3D datasets, models, and standard evaluation metrics. To address
this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans
captured at 60 fps and synchronized audio from 12 speakers. We then train a
neural network on our dataset that factors identity from facial motion. The
learned model, VOCA (Voice Operated Character Animation) takes any speech
signal as input - even speech in languages other than English - and
realistically animates a wide range of adult faces. Conditioning on subject
labels during training allows the model to learn a variety of realistic
speaking styles. VOCA also provides animator controls to alter speaking style,
identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball
rotations) during animation. To our knowledge, VOCA is the only realistic 3D
facial animation model that is readily applicable to unseen subjects without
retargeting. This makes VOCA suitable for tasks like in-game video, virtual
reality avatars, or any scenario in which the speaker, speech, or language is
not known in advance. We make the dataset and model available for research
purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201
Bridging RL Theory and Practice with the Effective Horizon
Deep reinforcement learning (RL) works impressively in some environments and
fails catastrophically in others. Ideally, RL theory should be able to provide
an understanding of why this is, i.e. bounds predictive of practical
performance. Unfortunately, current theory does not quite have this ability. We
compare standard deep RL algorithms to prior sample complexity prior bounds by
introducing a new dataset, BRIDGE. It consists of 155 MDPs from common deep RL
benchmarks, along with their corresponding tabular representations, which
enables us to exactly compute instance-dependent bounds. We find that prior
bounds do not correlate well with when deep RL succeeds vs. fails, but discover
a surprising property that does. When actions with the highest Q-values under
the random policy also have the highest Q-values under the optimal policy, deep
RL tends to succeed; when they don't, deep RL tends to fail. We generalize this
property into a new complexity measure of an MDP that we call the effective
horizon, which roughly corresponds to how many steps of lookahead search are
needed in order to identify the next optimal action when leaf nodes are
evaluated with random rollouts. Using BRIDGE, we show that the effective
horizon-based bounds are more closely reflective of the empirical performance
of PPO and DQN than prior sample complexity bounds across four metrics. We also
show that, unlike existing bounds, the effective horizon can predict the
effects of using reward shaping or a pre-trained exploration policy
Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF
In practice, preference learning from human feedback depends on incomplete
data with hidden context. Hidden context refers to data that affects the
feedback received, but which is not represented in the data used to train a
preference model. This captures common issues of data collection, such as
having human annotators with varied preferences, cognitive processes that
result in seemingly irrational behavior, and combining data labeled according
to different criteria. We prove that standard applications of preference
learning, including reinforcement learning from human feedback (RLHF),
implicitly aggregate over hidden contexts according to a well-known voting rule
called Borda count. We show this can produce counter-intuitive results that are
very different from other methods which implicitly aggregate via expected
utility. Furthermore, our analysis formalizes the way that preference learning
from users with diverse values tacitly implements a social choice function. A
key implication of this result is that annotators have an incentive to
misreport their preferences in order to influence the learned model, leading to
vulnerabilities in the deployment of RLHF. As a step towards mitigating these
problems, we introduce a class of methods called distributional preference
learning (DPL). DPL methods estimate a distribution of possible score values
for each alternative in order to better account for hidden context.
Experimental results indicate that applying DPL to RLHF for LLM chatbots
identifies hidden context in the data and significantly reduces subsequent
jailbreak vulnerability. Our code and data are available at
https://github.com/cassidylaidlaw/hidden-contextComment: Presented at ICLR 202
The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Reinforcement learning (RL) theory has largely focused on proving minimax
sample complexity bounds. These require strategic exploration algorithms that
use relatively limited function classes for representing the policy or value
function. Our goal is to explain why deep RL algorithms often perform well in
practice, despite using random exploration and much more expressive function
classes like neural networks. Our work arrives at an explanation by showing
that many stochastic MDPs can be solved by performing only a few steps of value
iteration on the random policy's Q function and then acting greedily. When this
is true, we find that it is possible to separate the exploration and learning
components of RL, making it much easier to analyze. We introduce a new RL
algorithm, SQIRL, that iteratively learns a near-optimal policy by exploring
randomly to collect rollouts and then performing a limited number of steps of
fitted-Q iteration over those rollouts. Any regression algorithm that satisfies
basic in-distribution generalization properties can be used in SQIRL to
efficiently solve common MDPs. This can explain why deep RL works, since it is
empirically established that neural networks generalize well in-distribution.
Furthermore, SQIRL explains why random exploration works well in practice. We
leverage SQIRL to derive instance-dependent sample complexity bounds for RL
that are exponential only in an "effective horizon" of lookahead and on the
complexity of the class used for function approximation. Empirically, we also
find that SQIRL performance strongly correlates with PPO and DQN performance in
a variety of stochastic environments, supporting that our theoretical analysis
is predictive of practical performance. Our code and data are available at
https://github.com/cassidylaidlaw/effective-horizon
Toward Computationally Efficient Inverse Reinforcement Learning via Reward Shaping
Inverse reinforcement learning (IRL) is computationally challenging, with
common approaches requiring the solution of multiple reinforcement learning
(RL) sub-problems. This work motivates the use of potential-based reward
shaping to reduce the computational burden of each RL sub-problem. This work
serves as a proof-of-concept and we hope will inspire future developments
towards computationally efficient IRL
Knocking at the gate: The path to publication for entrepreneurship experiments through the lens of gatekeeping theory
We draw on gatekeeping theory to explore the individual and routine-level criticisms that entrepreneurship experimentalists receive during the review process. Using a multi-study approach, we categorize common gatekeeping themes and present illustrative critiques derived from a unique sample of decision letters and a supplemental survey of entrepreneurship editors. In combination, we extend gatekeeping theory by considering how it applies to the scholarly domain, contribute to the literature by exploring an alternative theoretical explanation as to why entrepreneurship experiments might fail to survive the review process, and finally, provide contextualized recommendations for authors and reviewers of experimental research