Search CORE

16 research outputs found

Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications

Author: Brattoli Biagio
Chalupka Krzysztof
Perona Pietro
Tighe Joseph
Zhdanov Fedor
Publication venue
Publication date: 03/03/2020
Field of study

Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classification. Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features. This is in contrast to previous video ZSL methods, which use pretrained feature extractors. We also extend the current benchmarking paradigm: Previous techniques aim to make the test task unknown at training time but fall short of this goal. We encourage domain shift across training and test data and disallow tailoring a ZSL model to a specific test dataset. We outperform the state-of-the-art by a wide margin. Our code, evaluation procedure and model weights are available at this http URL

arXiv.org e-Print Archive

Crossref

Caltech Authors

Query Twice: Dual Mixture Attention Meta Learning for Video Summarization

Author: Bai Yang
Chai Zhenhua
Guan Yu
Hu Bingzhang
Long Yang
Wang Junyan
Wei Xiaolin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/08/2020
Field of study

Video summarization aims to select representative frames to retain high-level information, which is usually solved by predicting the segment-wise importance score via a softmax function. However, softmax function suffers in retaining high-rank representations for complex visual or sequential information, which is known as the Softmax Bottleneck problem. In this paper, we propose a novel framework named Dual Mixture Attention (DMASum) model with Meta Learning for video summarization that tackles the softmax bottleneck problem, where the Mixture of Attention layer (MoA) effectively increases the model capacity by employing twice self-query attention that can capture the second-order changes in addition to the initial query-key attention, and a novel Single Frame Meta Learning rule is then introduced to achieve more generalization to small datasets with limited training sources. Furthermore, the DMASum significantly exploits both visual and sequential attention that connects local key-frame and global attention in an accumulative way. We adopt the new evaluation protocol on two public datasets, SumMe, and TVSum. Both qualitative and quantitative experiments manifest significant improvements over the state-of-the-art methods.Comment: This manuscript has been accepted at ACM MM 202

arXiv.org e-Print Archive

Crossref

Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages?

Author: Bilge Yunus Can
Cinbis Ramazan Gokberk
Ikizler-Cinbis Nazli
Publication venue
Publication date: 24/07/2019
Field of study

We introduce the problem of zero-shot sign language recognition (ZSSLR), where the goal is to leverage models learned over the seen sign class examples to recognize the instances of unseen signs. To this end, we propose to utilize the readily available descriptions in sign language dictionaries as an intermediate-level semantic representation for knowledge transfer. We introduce a new benchmark dataset called ASL-Text that consists of 250 sign language classes and their accompanying textual descriptions. Compared to the ZSL datasets in other domains (such as object recognition), our dataset consists of limited number of training examples for a large number of classes, which imposes a significant challenge. We propose a framework that operates over the body and hand regions by means of 3D-CNNs, and models longer temporal relationships via bidirectional LSTMs. By leveraging the descriptive text embeddings along with these spatio-temporal representations within a zero-shot learning framework, we show that textual data can indeed be useful in uncovering sign languages. We anticipate that the introduced approach and the accompanying dataset will provide a basis for further exploration of this new zero-shot learning problem.Comment: To appear in British Machine Vision Conference (BMVC) 201

arXiv.org e-Print Archive

OpenMETU (Middle East Technical University)

ZSTAD: Zero-Shot Temporal Activity Detection

Author: Chang Xiaojun
Ge Zongyuan
Hauptmann Alexander
Liu Jun
Luo Minnan
Wang Sen
Zhang Lingling
Publication venue
Publication date: 11/03/2020
Field of study

An integral part of video analysis and surveillance is temporal activity detection, which means to simultaneously recognize and localize activities in long untrimmed videos. Currently, the most effective methods of temporal activity detection are based on deep learning, and they typically perform very well with large scale annotated videos for training. However, these methods are limited in real applications due to the unavailable videos about certain activity classes and the time-consuming data annotation. To solve this challenging problem, we propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected. We design an end-to-end deep network based on R-C3D as the architecture for this solution. The proposed network is optimized with an innovative loss function that considers the embeddings of activity labels and their super-classes while learning the common semantics of seen and unseen activities. Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities

arXiv.org e-Print Archive

Crossref

CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition

Author: Gowda Shreyank N
Keller Frank
Rohrbach Marcus
Sevilla-Lara Laura
Publication venue
Publication date: 18/01/2021
Field of study

Zero-shot action recognition is the task of recognizing action classes without visual examples, only with a semantic embedding which relates unseen to seen classes. The problem can be seen as learning a function which generalizes well to instances of unseen classes without losing discrimination between classes. Neural networks can model the complex boundaries between visual classes, which explains their success as supervised models. However, in zero-shot learning, these highly specialized class boundaries may not transfer well from seen to unseen classes. In this paper, we propose a clustering-based model, which considers all training samples at once, instead of optimizing for each instance individually. We optimize the clustering using Reinforcement Learning which we show is critical for our approach to work. We call the proposed method CLASTER and observe that it consistently improves over the state-of-the-art in all standard datasets, UCF101, HMDB51, and Olympic Sports; both in the standard zero-shot evaluation and the generalized zero-shot learning

arXiv.org e-Print Archive

Edinburgh Research Explorer

Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications

Author: Brattoli Biagio
Chalupka Krzysztof
Perona Pietro
Tighe Joseph
Zhdanov Fedor
Publication venue
Publication date: 03/03/2020
Field of study