This article describes the final solution of team monkeytyping, who finished
in second place in the YouTube-8M video understanding challenge. The dataset
used in this challenge is a large-scale benchmark for multi-label video
classification. We extend the work in [1] and propose several improvements for
frame sequence modeling. We propose a network structure called Chaining that
can better capture the interactions between labels. Also, we report our
approaches in dealing with multi-scale information and attention pooling. In
addition, We find that using the output of model ensemble as a side target in
training can boost single model performance. We report our experiments in
bagging, boosting, cascade, and stacking, and propose a stacking algorithm
called attention weighted stacking. Our final submission is an ensemble that
consists of 74 sub models, all of which are listed in the appendix.Comment: Submitted to the CVPR 2017 Workshop on YouTube-8M Large-Scale Video
Understandin