13 research outputs found
Learning spontaneity to improve emotion recognition in speech
We investigate the effect and usefulness of spontaneity (i.e. whether a given speech is spontaneous or not) in speech in the context of emotion recognition. We hypothesize that emotional content in speech is interrelated with its spontaneity, and use spontaneity classification as an auxiliary task to the problem of emotion recognition. We propose two supervised learning settings that utilize spontaneity to improve speech emotion recognition: a hierarchical model that performs spontaneity detection before performing emotion recognition, and a multitask learning model that jointly learns to recognize both spontaneity and emotion. Through various experiments on the well known IEMOCAP database, we show that by using spontaneity detection as an additional task, significant improvement can be achieved over emotion recognition systems that are unaware of spontaneity. We achieve state-of-the-art emotion recognition accuracy (4-class, 69.1%) on the IEMOCAP database outperforming several relevant and competitive baselines
Future Person Localization in First-Person Videos
We present a new task that predicts future locations of people observed in
first-person videos. Consider a first-person video stream continuously recorded
by a wearable camera. Given a short clip of a person that is extracted from the
complete stream, we aim to predict that person's location in future frames. To
facilitate this future person localization ability, we make the following three
key observations: a) First-person videos typically involve significant
ego-motion which greatly affects the location of the target person in future
frames; b) Scales of the target person act as a salient cue to estimate a
perspective effect in first-person videos; c) First-person videos often capture
people up-close, making it easier to leverage target poses (e.g., where they
look) for predicting their future locations. We incorporate these three
observations into a prediction framework with a multi-stream
convolution-deconvolution architecture. Experimental results reveal our method
to be effective on our new dataset as well as on a public social interaction
dataset.Comment: Accepted to CVPR 201
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
We introduce EgoSchema, a very long-form video question-answering dataset,
and benchmark to evaluate long video understanding capabilities of modern
vision and language systems. Derived from Ego4D, EgoSchema consists of over
5000 human curated multiple choice question answer pairs, spanning over 250
hours of real video data, covering a very broad range of natural human activity
and behavior. For each question, EgoSchema requires the correct answer to be
selected between five given options based on a three-minute-long video clip.
While some prior works have proposed video datasets with long clip lengths, we
posit that merely the length of the video clip does not truly capture the
temporal difficulty of the video task that is being considered. To remedy this,
we introduce temporal certificate sets, a general notion for capturing the
intrinsic temporal understanding length associated with a broad range of video
understanding tasks & datasets. Based on this metric, we find EgoSchema to have
intrinsic temporal lengths over 5.7x longer than the second closest dataset and
10x to 100x longer than any other video understanding dataset. Further, our
evaluation of several current state-of-the-art video and language models shows
them to be severely lacking in long-term video understanding capabilities. Even
models with several billions of parameters achieve QA accuracy less than 33%
(random is 20%) on the EgoSchema multi-choice question answering task, while
humans achieve about 76% accuracy. We posit that \name{}{}, with its long
intrinsic temporal structures and diverse complexity, would serve as a valuable
evaluation probe for developing effective long-term video understanding systems
in the future. Data and Zero-shot model evaluation code are open-sourced for
both public and commercial use under the Ego4D license at
http://egoschema.github.ioComment: https://egoschema.github.io
Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization
Temporal action localization (TAL) requires long-form reasoning to predict
actions of various lengths and complex content. Given limited GPU memory,
training TAL end-to-end on such long-form videos (i.e., from videos to
predictions) is a significant challenge. Most methods can only train on
pre-extracted features without optimizing them for the localization problem,
consequently limiting localization performance. In this work, to extend the
potential in TAL networks, we propose a novel end-to-end method Re2TAL, which
rewires pretrained video backbones for reversible TAL. Re2TAL builds a backbone
with reversible modules, where the input can be recovered from the output such
that the bulky intermediate activations can be cleared from memory during
training. Instead of designing one single type of reversible module, we propose
a network rewiring mechanism, to transform any module with a residual
connection to a reversible module without changing any parameters. This
provides two benefits: (1) a large variety of reversible networks are easily
obtained from existing and even future model designs, and (2) the reversible
models require much less training effort as they reuse the pre-trained
parameters of their original non-reversible versions. Re2TAL reaches 37.01%
average mAP, a new state-of-the-art record on ActivityNet-v1.3, and mAP 64.9%
at tIoU=0.5 on THUMOS-14 without using optimal flow
Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision
We tackle the problem of Human Locomotion Forecasting, a task for jointly
predicting the spatial positions of several keypoints on the human body in the
near future under an egocentric setting. In contrast to the previous work that
aims to solve either the task of pose prediction or trajectory forecasting in
isolation, we propose a framework to unify the two problems and address the
practically useful task of pedestrian locomotion prediction in the wild. Among
the major challenges in solving this task is the scarcity of annotated
egocentric video datasets with dense annotations for pose, depth, or egomotion.
To surmount this difficulty, we use state-of-the-art models to generate (noisy)
annotations and propose robust forecasting models that can learn from this
noisy supervision. We present a method to disentangle the overall pedestrian
motion into easier to learn subparts by utilizing a pose completion and a
decomposition module. The completion module fills in the missing key-point
annotations and the decomposition module breaks the cleaned locomotion down to
global (trajectory) and local (pose keypoint movements). Further, with Quasi
RNN as our backbone, we propose a novel hierarchical trajectory forecasting
network that utilizes low-level vision domain specific signals like egomotion
and depth to predict the global trajectory. Our method leads to
state-of-the-art results for the prediction of human locomotion in the
egocentric view. Project pade: https://karttikeya.github.io/publication/plf/Comment: Accepted to WACV 2020 (Oral
Big Little Transformer Decoder
The recent emergence of Large Language Models based on the Transformer
architecture has enabled dramatic advancements in the field of Natural Language
Processing. However, these models have long inference latency, which limits
their deployment, and which makes them prohibitively expensive for various
real-time applications. The inference latency is further exacerbated by
autoregressive generative tasks, as models need to run iteratively to generate
tokens sequentially without leveraging token-level parallelization. To address
this, we propose Big Little Decoder (BiLD), a framework that can improve
inference efficiency and latency for a wide range of text generation
applications. The BiLD framework contains two models with different sizes that
collaboratively generate text. The small model runs autoregressively to
generate text with a low inference cost, and the large model is only invoked
occasionally to refine the small model's inaccurate predictions in a
non-autoregressive manner. To coordinate the small and large models, BiLD
introduces two simple yet effective policies: (1) the fallback policy that
determines when to hand control over to the large model; and (2) the rollback
policy that determines when the large model needs to review and correct the
small model's inaccurate predictions. To evaluate our framework across
different tasks and models, we apply BiLD to various text generation scenarios
encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En,
summarization on CNN/DailyMail, and language modeling on WikiText-2. On an
NVIDIA Titan Xp GPU, our framework achieves a speedup of up to 2.13x without
any performance drop, and it achieves up to 2.38x speedup with only ~1 point
degradation. Furthermore, our framework is fully plug-and-play as it does not
require any training or modifications to model architectures. Our code will be
open-sourced