1,816 research outputs found
Second-order Temporal Pooling for Action Recognition
Deep learning models for video-based action recognition usually generate
features for short clips (consisting of a few frames); such clip-level features
are aggregated to video-level representations by computing statistics on these
features. Typically zero-th (max) or the first-order (average) statistics are
used. In this paper, we explore the benefits of using second-order statistics.
Specifically, we propose a novel end-to-end learnable feature aggregation
scheme, dubbed temporal correlation pooling that generates an action descriptor
for a video sequence by capturing the similarities between the temporal
evolution of clip-level CNN features computed across the video. Such a
descriptor, while being computationally cheap, also naturally encodes the
co-activations of multiple CNN features, thereby providing a richer
characterization of actions than their first-order counterparts. We also
propose higher-order extensions of this scheme by computing correlations after
embedding the CNN features in a reproducing kernel Hilbert space. We provide
experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained
datasets such as MPII Cooking activities and JHMDB, as well as the recent
Kinetics-600. Our results demonstrate the advantages of higher-order pooling
schemes that when combined with hand-crafted features (as is standard practice)
achieves state-of-the-art accuracy.Comment: Accepted in the International Journal of Computer Vision (IJCV
Unsupervised Human Action Detection by Action Matching
We propose a new task of unsupervised action detection by action matching.
Given two long videos, the objective is to temporally detect all pairs of
matching video segments. A pair of video segments are matched if they share the
same human action. The task is category independent---it does not matter what
action is being performed---and no supervision is used to discover such video
segments. Unsupervised action detection by action matching allows us to align
videos in a meaningful manner. As such, it can be used to discover new action
categories or as an action proposal technique within, say, an action detection
pipeline. Moreover, it is a useful pre-processing step for generating video
highlights, e.g., from sports videos.
We present an effective and efficient method for unsupervised action
detection. We use an unsupervised temporal encoding method and exploit the
temporal consistency in human actions to obtain candidate action segments. We
evaluate our method on this challenging task using three activity recognition
benchmarks, namely, the MPII Cooking activities dataset, the THUMOS15 action
detection benchmark and a new dataset called the IKEA dataset. On the MPII
Cooking dataset we detect action segments with a precision of 21.6% and recall
of 11.7% over 946 long video pairs and over 5000 ground truth action segments.
Similarly, on THUMOS dataset we obtain 18.4% precision and 25.1% recall over
5094 ground truth action segment pairs.Comment: IEEE International Conference on Computer Vision and Pattern
Recognition CVPR 2017 Workshop
Recommended from our members
An Early Holiday Surprise: Cholecystitis Wrapped in Takotsubo Cardiomyopathy
This is a novel case report of a 44-year-old woman who presented to the emergency department with epigastric pain wrapping around to her back. She had no risk factors for cardiac disease, but her initial electrocardiogram (ECG) showed a Wellens syndrome pattern and she was taken urgently to the catheterization lab. After a negative catheterization, she underwent cardiac magnetic resonance imaging, which was positive for Takotsubo cardiomyopathy (TC). Ultimately, abdominal computed tomography revealed that she had cholecystitis, which likely was the cause of her TC and ECG changes
The Allure of Celebrities: Unpacking Their Polysemic Consumer Appeal
The file attached to this record is the author's final peer reviewed version.To explain their deep resonance with consumers this paper unpacks the individual constituents of a celebrity’s polysemic appeal. While celebrities are traditionally theorised as unidimensional ‘semiotic receptacles of cultural meaning’, we conceptualise them here instead as human beings/performers with a multi-constitutional, polysemic consumer appeal.
Supporting evidence is drawn from autoethnographic data collected over a total period of 25 months and structured through a hermeneutic analysis.
In ‘rehumanising’ the celebrity, the study finds that each celebrity offers the individual consumer a unique and very personal parasocial appeal as a) the performer, b) the ‘private’ person behind the public performer, c) the tangible manifestation of either through products, and d) the social link to other consumers. The stronger these constituents, individually or symbiotically, appeal to the consumer’s personal desires the more s/he feels emotionally attached to this particular celebrity.
Although using autoethnography means that the breadth of collected data is limited, the depth of insight this approach garners sufficiently unpacks the polysemic appeal of celebrities to consumers.
The findings encourage talent agents, publicists and marketing managers to reconsider underlying assumptions in their talent management and/or celebrity endorsement practices. While prior research on celebrity appeal has tended to enshrine celebrities in a “dehumanised” structuralist semiosis, which erases the very idea of individualised consumer meanings, this paper reveals the multi-constitutional polysemy of any particular celebrity’s personal appeal as a performer and human being to any particular consumer
Guided Open Vocabulary Image Captioning with Constrained Beam Search
Existing image captioning models do not generalize well to out-of-domain
images containing novel scenes or objects. This limitation severely hinders the
use of these models in real world applications dealing with images in the wild.
We address this problem using a flexible approach that enables existing deep
captioning architectures to take advantage of image taggers at test time,
without re-training. Our method uses constrained beam search to force the
inclusion of selected tag words in the output, and fixed, pretrained word
embeddings to facilitate vocabulary expansion to previously unseen tag words.
Using this approach we achieve state of the art results for out-of-domain
captioning on MSCOCO (and improved results for in-domain captioning). Perhaps
surprisingly, our results significantly outperform approaches that incorporate
the same tag predictions into the learning algorithm. We also show that we can
significantly improve the quality of generated ImageNet captions by leveraging
ground-truth labels.Comment: EMNLP 201
- …