366 research outputs found
3D pose estimation of flying animals in multi-view video datasets
Flying animals such as bats, birds, and moths are actively studied by researchers wanting to better understand these animals’ behavior and flight characteristics. Towards this goal, multi-view videos of flying animals have been recorded both in lab- oratory conditions and natural habitats. The analysis of these videos has shifted over time from manual inspection by scientists to more automated and quantitative approaches based on computer vision algorithms.
This thesis describes a study on the largely unexplored problem of 3D pose estimation of flying animals in multi-view video data. This problem has received little attention in the computer vision community where few flying animal datasets exist. Additionally, published solutions from researchers in the natural sciences have not taken full advantage of advancements in computer vision research. This thesis addresses this gap by proposing three different approaches for 3D pose estimation of flying animals in multi-view video datasets, which evolve from successful pose estimation paradigms used in computer vision. The first approach models the appearance of a flying animal with a synthetic 3D graphics model and then uses a Markov Random Field to model 3D pose estimation over time as a single optimization problem. The second approach builds on the success of Pictorial Structures models and further improves them for the case where only a sparse set of landmarks are annotated in training data. The proposed approach first discovers parts from regions of the training images that are not annotated. The discovered parts are then used to generate more accurate appearance likelihood terms which in turn produce more accurate landmark localizations. The third approach takes advantage of the success of deep learning models and adapts existing deep architectures to perform landmark localization. Both the second and third approaches perform 3D pose estimation by first obtaining accurate localization of key landmarks in individual views, and then using calibrated cameras and camera geometry to reconstruct the 3D position of key landmarks.
This thesis shows that the proposed algorithms generate first-of-a-kind and leading results on real world datasets of bats and moths, respectively. Furthermore, a variety of resources are made freely available to the public to further strengthen the connection between research communities
Improving Semantic Embedding Consistency by Metric Learning for Zero-Shot Classification
This paper addresses the task of zero-shot image classification. The key
contribution of the proposed approach is to control the semantic embedding of
images -- one of the main ingredients of zero-shot learning -- by formulating
it as a metric learning problem. The optimized empirical criterion associates
two types of sub-task constraints: metric discriminating capacity and accurate
attribute prediction. This results in a novel expression of zero-shot learning
not requiring the notion of class in the training phase: only pairs of
image/attributes, augmented with a consistency indicator, are given as ground
truth. At test time, the learned model can predict the consistency of a test
image with a given set of attributes , allowing flexible ways to produce
recognition inferences. Despite its simplicity, the proposed approach gives
state-of-the-art results on four challenging datasets used for zero-shot
recognition evaluation.Comment: in ECCV 2016, Oct 2016, amsterdam, Netherlands. 201
Video Storytelling: Textual Summaries for Events
Bridging vision and natural language is a longstanding goal in computer
vision and multimedia research. While earlier works focus on generating a
single-sentence description for visual content, recent works have studied
paragraph generation. In this work, we introduce the problem of video
storytelling, which aims at generating coherent and succinct stories for long
videos. Video storytelling introduces new challenges, mainly due to the
diversity of the story and the length and complexity of the video. We propose
novel methods to address the challenges. First, we propose a context-aware
framework for multimodal embedding learning, where we design a Residual
Bidirectional Recurrent Neural Network to leverage contextual information from
past and future. Second, we propose a Narrator model to discover the underlying
storyline. The Narrator is formulated as a reinforcement learning agent which
is trained by directly optimizing the textual metric of the generated story. We
evaluate our method on the Video Story dataset, a new dataset that we have
collected to enable the study. We compare our method with multiple
state-of-the-art baselines, and show that our method achieves better
performance, in terms of quantitative measures and user study.Comment: Published in IEEE Transactions on Multimedi
- …