14,245 research outputs found
GolfDB: A Video Database for Golf Swing Sequencing
The golf swing is a complex movement requiring considerable full-body
coordination to execute proficiently. As such, it is the subject of frequent
scrutiny and extensive biomechanical analyses. In this paper, we introduce the
notion of golf swing sequencing for detecting key events in the golf swing and
facilitating golf swing analysis. To enable consistent evaluation of golf swing
sequencing performance, we also introduce the benchmark database GolfDB,
consisting of 1400 high-quality golf swing videos, each labeled with event
frames, bounding box, player name and sex, club type, and view type.
Furthermore, to act as a reference baseline for evaluating golf swing
sequencing performance on GolfDB, we propose a lightweight deep neural network
called SwingNet, which possesses a hybrid deep convolutional and recurrent
neural network architecture. SwingNet correctly detects eight golf swing events
at an average rate of 76.1%, and six out of eight events at a rate of 91.8%. In
line with the proposed baseline SwingNet, we advocate the use of
computationally efficient models in future research to promote in-the-field
analysis via deployment on readily-available mobile devices
Human Action Recognition and Prediction: A Survey
Derived from rapid advances in computer vision and machine learning, video
analysis tasks have been moving from inferring the present state to predicting
the future state. Vision-based action recognition and prediction from videos
are such tasks, where action recognition is to infer human actions (present
state) based upon complete action executions, and action prediction to predict
human actions (future state) based upon incomplete action executions. These two
tasks have become particularly prevalent topics recently because of their
explosively emerging real-world applications, such as visual surveillance,
autonomous driving vehicle, entertainment, and video retrieval, etc. Many
attempts have been devoted in the last a few decades in order to build a robust
and effective framework for action recognition and prediction. In this paper,
we survey the complete state-of-the-art techniques in the action recognition
and prediction. Existing models, popular algorithms, technical difficulties,
popular action databases, evaluation protocols, and promising future directions
are also provided with systematic discussions
Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions
In-depth scene descriptions and question answering tasks have greatly
increased the scope of today's definition of scene understanding. While such
tasks are in principle open ended, current formulations primarily focus on
describing only the current state of the scenes under consideration. In
contrast, in this paper, we focus on the future states of the scenes which are
also conditioned on actions. We posit this as a question answering task, where
an answer has to be given about a future scene state, given observations of the
current scene, and a question that includes a hypothetical action. Our solution
is a hybrid model which integrates a physics engine into a question answering
architecture in order to anticipate future scene states resulting from
object-object interactions caused by an action. We demonstrate first results on
this challenging new problem and compare to baselines, where we outperform
fully data-driven end-to-end learning approaches.Comment: Paper: 18 pages, 5 figures, 5 tables. Supplementary material: 3
pages, 1 figure, 1 table. To be published in VLEASE ECCV 2018 worksho
Learning Spatiotemporal Features via Video and Text Pair Discrimination
Current video representations heavily rely on learning from manually
annotated video datasets which are time-consuming and expensive to acquire. We
observe videos are naturally accompanied by abundant text information such as
YouTube titles and Instagram captions. In this paper, we leverage this
visual-textual connection to learn spatiotemporal features in an efficient
weakly-supervised manner. We present a general cross-modal pair discrimination
(CPD) framework to capture this correlation between a video and its associated
text. Specifically, we adopt noise-contrastive estimation to tackle the
computational issue imposed by the huge amount of pair instance classes and
design a practical curriculum learning strategy. We train our CPD models on
both standard video dataset (Kinetics-210k) and uncurated web video dataset
(Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning,
the learnt models obtain competitive results for action classification on
Kinetics under the linear classification protocol. Moreover, our visual model
provides an effective initialization to fine-tune on downstream tasks, which
yields a remarkable performance gain for action recognition on UCF101 and
HMDB51, compared with the existing state-of-the-art self-supervised training
methods. In addition, our CPD model yields a new state of the art for zero-shot
action recognition on UCF101 by directly utilizing the learnt visual-textual
embeddings. The code will be made available at
https://github.com/MCG-NJU/CPD-Video.Comment: Technical Repor
Deep Unified Multimodal Embeddings for Understanding both Content and Users in Social Media Networks
There has been an explosion of multimodal content generated on social media
networks in the last few years, which has necessitated a deeper understanding
of social media content and user behavior. We present a novel
content-independent content-user-reaction model for social multimedia content
analysis. Compared to prior works that generally tackle semantic content
understanding and user behavior modeling in isolation, we propose a generalized
solution to these problems within a unified framework. We embed users, images
and text drawn from open social media in a common multimodal geometric space,
using a novel loss function designed to cope with distant and disparate
modalities, and thereby enable seamless three-way retrieval. Our model not only
outperforms unimodal embedding based methods on cross-modal retrieval tasks but
also shows improvements stemming from jointly solving the two tasks on Twitter
data. We also show that the user embeddings learned within our joint multimodal
embedding model are better at predicting user interests compared to those
learned with unimodal content on Instagram data. Our framework thus goes beyond
the prior practice of using explicit leader-follower link information to
establish affiliations by extracting implicit content-centric affiliations from
isolated users. We provide qualitative results to show that the user clusters
emerging from learned embeddings have consistent semantics and the ability of
our model to discover fine-grained semantics from noisy and unstructured data.
Our work reveals that social multimodal content is inherently multimodal and
possesses a consistent structure because in social networks meaning is created
through interactions between users and content.Comment: Preprint submitted to IJC
Modeling Image Virality with Pairwise Spatial Transformer Networks
The study of virality and information diffusion online is a topic gaining
traction rapidly in the computational social sciences. Computer vision and
social network analysis research have also focused on understanding the impact
of content and information diffusion in making content viral, with prior
approaches not performing significantly well as other traditional
classification tasks. In this paper, we present a novel pairwise reformulation
of the virality prediction problem as an attribute prediction task and develop
a novel algorithm to model image virality on online media using a pairwise
neural network. Our model provides significant insights into the features that
are responsible for promoting virality and surpasses the existing
state-of-the-art by a 12% average improvement in prediction. We also
investigate the effect of external category supervision on relative attribute
prediction and observe an increase in prediction accuracy for the same across
several attribute learning datasets.Comment: 9 pages, Accepted as a full paper at the ACM Multimedia Conference
(MM) 201
Prediction and Description of Near-Future Activities in Video
Most of the existing works on human activity analysis focus on recognition or
early recognition of the activity labels from complete or partial observations.
Similarly, existing video captioning approaches focus on the observed events in
videos. Predicting the labels and the captions of future activities where no
frames of the predicted activities have been observed is a challenging problem,
with important applications that require anticipatory response. In this work,
we propose a system that can infer the labels and the captions of a sequence of
future activities. Our proposed network for label prediction of a future
activity sequence is similar to a hybrid Siamese network with three branches
where the first branch takes visual features from the objects present in the
scene, the second branch takes observed activity features and the third branch
captures the last observed activity features. The predicted labels and the
observed scene context are then mapped to meaningful captions using a
sequence-to-sequence learning-based method. Experiments on three challenging
activity analysis datasets and a video description dataset demonstrate that
both our label prediction framework and captioning framework outperform the
state-of-the-arts.Comment: 14 pages, 4 figures, 14 table
Towards Physics-informed Deep Learning for Turbulent Flow Prediction
While deep learning has shown tremendous success in a wide range of domains,
it remains a grand challenge to incorporate physical principles in a systematic
manner to the design, training, and inference of such models. In this paper, we
aim to predict turbulent flow by learning its highly nonlinear dynamics from
spatiotemporal velocity fields of large-scale fluid flow simulations of
relevance to turbulence modeling and climate modeling. We adopt a hybrid
approach by marrying two well-established turbulent flow simulation techniques
with deep learning. Specifically, we introduce trainable spectral filters in a
coupled model of Reynolds-averaged Navier-Stokes (RANS) and Large Eddy
Simulation (LES), followed by a specialized U-net for prediction. Our approach,
which we call turbulent-Flow Net (TF-Net), is grounded in a principled physics
model, yet offers the flexibility of learned representations. We compare our
model, TF-Net, with state-of-the-art baselines and observe significant
reductions in error for predictions 60 frames ahead. Most importantly, our
method predicts physical fields that obey desirable physical characteristics,
such as conservation of mass, whilst faithfully emulating the turbulent kinetic
energy field and spectrum, which are critical for accurate prediction of
turbulent flows
Predicting How to Distribute Work Between Algorithms and Humans to Segment an Image Batch
Foreground object segmentation is a critical step for many image analysis
tasks. While automated methods can produce high-quality results, their failures
disappoint users in need of practical solutions. We propose a resource
allocation framework for predicting how best to allocate a fixed budget of
human annotation effort in order to collect higher quality segmentations for a
given batch of images and automated methods. The framework is based on a
prediction module that estimates the quality of given algorithm-drawn
segmentations. We demonstrate the value of the framework for two novel tasks
related to predicting how to distribute annotation efforts between algorithms
and humans. Specifically, we develop two systems that automatically decide, for
a batch of images, when to recruit humans versus computers to create 1) coarse
segmentations required to initialize segmentation tools and 2) final,
fine-grained segmentations. Experiments demonstrate the advantage of relying on
a mix of human and computer efforts over relying on either resource alone for
segmenting objects in images coming from three diverse modalities (visible,
phase contrast microscopy, and fluorescence microscopy)
Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification
Videos are inherently multimodal. This paper studies the problem of how to
fully exploit the abundant multimodal clues for improved video categorization.
We introduce a hybrid deep learning framework that integrates useful clues from
multiple modalities, including static spatial appearance information, motion
patterns within a short time window, audio information as well as long-range
temporal dynamics. More specifically, we utilize three Convolutional Neural
Networks (CNNs) operating on appearance, motion and audio signals to extract
their corresponding features. We then employ a feature fusion network to derive
a unified representation with an aim to capture the relationships among
features. Furthermore, to exploit the long-range temporal dynamics in videos,
we apply two Long Short Term Memory networks with extracted appearance and
motion features as inputs. Finally, we also propose to refine the prediction
scores by leveraging contextual relationships among video semantics. The hybrid
deep learning framework is able to exploit a comprehensive set of multimodal
features for video classification. Through an extensive set of experiments, we
demonstrate that (1) LSTM networks which model sequences in an explicitly
recurrent manner are highly complementary with CNN models; (2) the feature
fusion network which produces a fused representation through modeling feature
relationships outperforms alternative fusion strategies; (3) the semantic
context of video classes can help further refine the predictions for improved
performance. Experimental results on two challenging benchmarks, the UCF-101
and the Columbia Consumer Videos (CCV), provide strong quantitative evidence
that our framework achieves promising results: on the UCF-101 and
on the CCV, outperforming competing methods with clear margins
- …