40 research outputs found
Hidden Two-Stream Convolutional Networks for Action Recognition
Analyzing videos of human actions involves understanding the temporal
relationships among video frames. State-of-the-art action recognition
approaches rely on traditional optical flow estimation methods to pre-compute
motion information for CNNs. Such a two-stage approach is computationally
expensive, storage demanding, and not end-to-end trainable. In this paper, we
present a novel CNN architecture that implicitly captures motion information
between adjacent frames. We name our approach hidden two-stream CNNs because it
only takes raw video frames as input and directly predicts action classes
without explicitly computing optical flow. Our end-to-end approach is 10x
faster than its two-stage baseline. Experimental results on four challenging
action recognition datasets: UCF101, HMDB51, THUMOS14 and ActivityNet v1.2 show
that our approach significantly outperforms the previous best real-time
approaches.Comment: Accepted at ACCV 2018, camera ready. Code available at
https://github.com/bryanyzhu/Hidden-Two-Strea
VideoGraph: Recognizing Minutes-Long Human Activities in Videos
Many human activities take minutes to unfold. To represent them, related
works opt for statistical pooling, which neglects the temporal structure.
Others opt for convolutional methods, as CNN and Non-Local. While successful in
learning temporal concepts, they are short of modeling minutes-long temporal
dependencies. We propose VideoGraph, a method to achieve the best of two
worlds: represent minutes-long human activities and learn their underlying
temporal structure. VideoGraph learns a graph-based representation for human
activities. The graph, its nodes and edges are learned entirely from video
datasets, making VideoGraph applicable to problems without node-level
annotation. The result is improvements over related works on benchmarks:
Epic-Kitchen and Breakfast. Besides, we demonstrate that VideoGraph is able to
learn the temporal structure of human activities in minutes-long videos
Video Synthesis from the StyleGAN Latent Space
Generative models have shown impressive results in generating synthetic images. However, video synthesis is still difficult to achieve, even for these generative models. The best videos that generative models can currently create are a few seconds long, distorted, and low resolution. For this project, I propose and implement a model to synthesize videos at 1024x1024x32 resolution that include human facial expressions by using static images generated from a Generative Adversarial Network trained on the human facial images. To the best of my knowledge, this is the first work that generates realistic videos that are larger than 256x256 resolution from single starting images. This model improves the video synthesis in both quantitative and qualitative ways compared to two state-of-the-art models: TGAN and MocoGAN. In a quantitative comparison, this project reaches a best Average Content Distance (ACD) score of 0.167, as compared to 0.305 and 0.201 of TGAN and MocoGAN, respectively