760 research outputs found
Temporal activity detection in untrimmed videos with recurrent neural networks
This work proposes a simple pipeline to classify and temporally localize activities in untrimmed videos. Our system uses features from a 3D Convolutional Neural Network (C3D) as input to train a a recurrent neural network (RNN) that learns to classify video clips of 16 frames. After clip prediction, we post-process the output of the RNN to assign a single activity label to each video, and determine the temporal boundaries of the activity within the video. We show how our system can achieve competitive results in both tasks with a simple architecture. We evaluate our method in the ActivityNet Challenge 2016, achieving a 0.5874 mAP and a 0.2237 mAP in the classification and detection tasks, respectively. Our code and models are publicly available at: https://imatge-upc.github.io/activitynet-2016-cvprw/Peer ReviewedPostprint (published version
Vision-based Detection of Acoustic Timed Events: a Case Study on Clarinet Note Onsets
Acoustic events often have a visual counterpart. Knowledge of visual
information can aid the understanding of complex auditory scenes, even when
only a stereo mixdown is available in the audio domain, \eg identifying which
musicians are playing in large musical ensembles. In this paper, we consider a
vision-based approach to note onset detection. As a case study we focus on
challenging, real-world clarinetist videos and carry out preliminary
experiments on a 3D convolutional neural network based on multiple streams and
purposely avoiding temporal pooling. We release an audiovisual dataset with 4.5
hours of clarinetist videos together with cleaned annotations which include
about 36,000 onsets and the coordinates for a number of salient points and
regions of interest. By performing several training trials on our dataset, we
learned that the problem is challenging. We found that the CNN model is highly
sensitive to the optimization algorithm and hyper-parameters, and that treating
the problem as binary classification may prevent the joint optimization of
precision and recall. To encourage further research, we publicly share our
dataset, annotations and all models and detail which issues we came across
during our preliminary experiments.Comment: Proceedings of the First International Conference on Deep Learning
and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [cs.NE]
- …