559 research outputs found
NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification
This paper introduces a fast and efficient network architecture, NeXtVLAD, to
aggregate frame-level features into a compact feature vector for large-scale
video classification. Briefly speaking, the basic idea is to decompose a
high-dimensional feature into a group of relatively low-dimensional vectors
with attention before applying NetVLAD aggregation over time. This NeXtVLAD
approach turns out to be both effective and parameter efficient in aggregating
temporal information. In the 2nd Youtube-8M video understanding challenge, a
single NeXtVLAD model with less than 80M parameters achieves a GAP score of
0.87846 in private leaderboard. A mixture of 3 NeXtVLAD models results in
0.88722, which is ranked 3rd over 394 teams. The code is publicly available at
https://github.com/linrongc/youtube-8m.Comment: ECCV 2018 worksho
Approach for Video Classification with Multi-label on YouTube-8M Dataset
Video traffic is increasing at a considerable rate due to the spread of
personal media and advancements in media technology. Accordingly, there is a
growing need for techniques to automatically classify moving images. This paper
use NetVLAD and NetFV models and the Huber loss function for video
classification problem and YouTube-8M dataset to verify the experiment. We
tried various attempts according to the dataset and optimize hyperparameters,
ultimately obtain a GAP score of 0.8668.Comment: Accepted at The 2nd Workshop on YouTube-8M Large-Scale Video
Understanding in ECCV 201
Large-Scale Video Classification with Feature Space Augmentation coupled with Learned Label Relations and Ensembling
This paper presents the Axon AI's solution to the 2nd YouTube-8M Video
Understanding Challenge, achieving the final global average precision (GAP) of
88.733% on the private test set (ranked 3rd among 394 teams, not considering
the model size constraint), and 87.287% using a model that meets size
requirement. Two sets of 7 individual models belonging to 3 different families
were trained separately. Then, the inference results on a training data were
aggregated from these multiple models and fed to train a compact model that
meets the model size requirement. In order to further improve performance we
explored and employed data over/sub-sampling in feature space, an additional
regularization term during training exploiting label relationship, and learned
weights for ensembling different individual models
Label Denoising with Large Ensembles of Heterogeneous Neural Networks
Despite recent advances in computer vision based on various convolutional
architectures, video understanding remains an important challenge. In this
work, we present and discuss a top solution for the large-scale video
classification (labeling) problem introduced as a Kaggle competition based on
the YouTube-8M dataset. We show and compare different approaches to
preprocessing, data augmentation, model architectures, and model combination.
Our final model is based on a large ensemble of video- and frame-level models
but fits into rather limiting hardware constraints. We apply an approach based
on knowledge distillation to deal with noisy labels in the original dataset and
the recently developed mixup technique to improve the basic models
Learnable Pooling Methods for Video Classification
We introduce modifications to state-of-the-art approaches to aggregating
local video descriptors by using attention mechanisms and function
approximations. Rather than using ensembles of existing architectures, we
provide an insight on creating new architectures. We demonstrate our solutions
in the "The 2nd YouTube-8M Video Understanding Challenge", by using frame-level
video and audio descriptors. We obtain testing accuracy similar to the state of
the art, while meeting budget constraints, and touch upon strategies to improve
the state of the art. Model implementations are available in
https://github.com/pomonam/LearnablePoolingMethods.Comment: Presented at Youtube 8M ECCV18 Worksho
Video Representation Learning and Latent Concept Mining for Large-scale Multi-label Video Classification
We report on CMU Informedia Lab's system used in Google's YouTube 8 Million
Video Understanding Challenge. In this multi-label video classification task,
our pipeline achieved 84.675% and 84.662% GAP on our evaluation split and the
official test set. We attribute the good performance to three components: 1)
Refined video representation learning with residual links and hypercolumns 2)
Latent concept mining which captures interactions among concepts. 3) Learning
with temporal segments and weighted multi-model ensemble. We conduct
experiments to validate and analyze the contribution of our models. We also
share some unsuccessful trials leveraging conventional approaches such as
recurrent neural networks for video representation learning for this
large-scale video dataset. All the codes to reproduce our results are publicly
available at https://github.com/Martini09/informedia-yt8m-release
Deep Multimodal Learning: An Effective Method for Video Classification
Videos have become ubiquitous on the Internet. And video analysis can provide
lots of information for detecting and recognizing objects as well as help
people understand human actions and interactions with the real world. However,
facing data as huge as TB level, effective methods should be applied. Recurrent
neural network (RNN) architecture has wildly been used on many sequential
learning problems such as Language Model, Time-Series Analysis, etc. In this
paper, we propose some variations of RNN such as stacked bidirectional LSTM/GRU
network with attention mechanism to categorize large-scale video data. We also
explore different multimodal fusion methods. Our model combines both visual and
audio information on both video and frame level and received great result.
Ensemble methods are also applied. Because of its multimodal characteristics,
we decide to call this method Deep Multimodal Learning(DML). Our DML-based
model was trained on Google Cloud and our own server and was tested in a
well-known video classification competition on Kaggle held by Google
Efficient Video Classification Using Fewer Frames
Recently,there has been a lot of interest in building compact models for
video classification which have a small memory footprint (<1 GB). While these
models are compact, they typically operate by repeated application of a small
weight matrix to all the frames in a video. E.g. recurrent neural network based
methods compute a hidden state for every frame of the video using a recurrent
weight matrix. Similarly, cluster-and-aggregate based methods such as NetVLAD,
have a learnable clustering matrix which is used to assign soft-clusters to
every frame in the video. Since these models look at every frame in the video,
the number of floating point operations (FLOPs) is still large even though the
memory footprint is small. We focus on building compute-efficient video
classification models which process fewer frames and hence have less number of
FLOPs. Similar to memory efficient models, we use the idea of distillation
albeit in a different setting. Specifically, in our case, a compute-heavy
teacher which looks at all the frames in the video is used to train a
compute-efficient student which looks at only a small fraction of frames in the
video. This is in contrast to a typical memory efficient Teacher-Student
setting, wherein both the teacher and the student look at all the frames in the
video but the student has fewer parameters. Our work thus complements the
research on memory efficient video classification. We do an extensive
evaluation with three types of models for video classification,viz.(i)
recurrent models (ii) cluster-and-aggregate models and (iii) memory-efficient
cluster-and-aggregate models and show that in each of these cases, a see-it-all
teacher can be used to train a compute efficient see-very-little student. We
show that the proposed student network can reduce the inference time by 30% and
the number of FLOPs by approximately 90% with a negligible drop in the
performance.Comment: To Appear in Proceedings of IEEE International Conference on Computer
Vision and Pattern Recognition (CVPR'2019
MOD: A Deep Mixture Model with Online Knowledge Distillation for Large Scale Video Temporal Concept Localization
In this paper, we present and discuss a deep mixture model with online
knowledge distillation (MOD) for large-scale video temporal concept
localization, which is ranked 3rd in the 3rd YouTube-8M Video Understanding
Challenge. Specifically, we find that by enabling knowledge sharing with online
distillation, fintuning a mixture model on a smaller dataset can achieve better
evaluation performance. Based on this observation, in our final solution, we
trained and fintuned 12 NeXtVLAD models in parallel with a 2-layer online
distillation structure. The experimental results show that the proposed
distillation structure can effectively avoid overfitting and shows superior
generalization performance. The code is publicly available at:
https://github.com/linrongc/solution_youtube8m_v3Comment: ICCV 2019 YouTube8M worksho
Learning to Localize Temporal Events in Large-scale Video Data
We address temporal localization of events in large-scale video data, in the
context of the Youtube-8M Segments dataset. This emerging field within video
recognition can enable applications to identify the precise time a specified
event occurs in a video, which has broad implications for video search. To
address this we present two separate approaches: (1) a gradient boosted
decision tree model on a crafted dataset and (2) a combination of deep learning
models based on frame-level data, video-level data, and a localization model.
The combinations of these two approaches achieved 5th place in the 3rd
Youtube-8M video recognition challenge.Comment: ICCV 2019, 3rd Youtube-8M Worksho
- …