25 research outputs found

    Detecting People in Artwork with CNNs

    Get PDF
    CNNs have massively improved performance in object detection in photographs. However research into object detection in artwork remains limited. We show state-of-the-art performance on a challenging dataset, People-Art, which contains people from photos, cartoons and 41 different artwork movements. We achieve this high performance by fine-tuning a CNN for this task, thus also demonstrating that training CNNs on photos results in overfitting for photos: only the first three or four layers transfer from photos to artwork. Although the CNN's performance is the highest yet, it remains less than 60\% AP, suggesting further work is needed for the cross-depiction problem. The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-46604-0_57Comment: 14 pages, plus 3 pages of references; 7 figures in ECCV 2016 Workshop

    Cross-Modal Supervision for Learning Active Speaker Detection in Video

    No full text
    © Springer International Publishing AG 2016. In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another.Chakravarty P., Tuytelaars T., ''Cross-modal supervision for learning active speaker detection in video'', Lecture notes in computer science, vol. 9909, pp. 285-301, 2016 (14th European conference on computer vision - ECCV 2016, October 11-14, 2016, Amsterdam, The Netherlands).status: publishe
    corecore