Search CORE

13 research outputs found

Weakly Supervised Localization using Deep Feature Maps

Author: Bency Archith J.
Karthikeyan S.
Kwon Heesung
Lee Hyungtae
Manjunath B. S.
Publication venue
Publication date: 01/01/2016
Field of study

Object localization is an important computer vision problem with a variety of applications. The lack of large scale object-level annotations and the relative abundance of image-level labels makes a compelling case for weak supervision in the object localization task. Deep Convolutional Neural Networks are a class of state-of-the-art methods for the related problem of object recognition. In this paper, we describe a novel object localization algorithm which uses classification networks trained on only image labels. This weakly supervised method leverages local spatial and semantic patterns captured in the convolutional layers of classification networks. We propose an efficient beam search based approach to detect and localize multiple objects in images. The proposed method significantly outperforms the state-of-the-art in standard object localization data-sets with a 8 point increase in mAP scores

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Feature and Region Selection for Visual Learning

Author: Cabral Ricardo
De la Torre Fernando
Wang Liantao
Zhao Ji
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/01/2016
Field of study

Visual learning problems such as object classification and action recognition are typically approached using extensions of the popular bag-of-words (BoW) model. Despite its great success, it is unclear what visual features the BoW model is learning: Which regions in the image or video are used to discriminate among classes? Which are the most discriminative visual words? Answering these questions is fundamental for understanding existing BoW models and inspiring better models for visual recognition. To answer these questions, this paper presents a method for feature selection and region selection in the visual BoW model. This allows for an intermediate visualization of the features and regions that are important for visual learning. The main idea is to assign latent weights to the features or regions, and jointly optimize these latent variables with the parameters of a classifier (e.g., support vector machine). There are four main benefits of our approach: (1) Our approach accommodates non-linear additive kernels such as the popular

\chi^2

and intersection kernel; (2) our approach is able to handle both regions in images and spatio-temporal regions in videos in a unified way; (3) the feature selection problem is convex, and both problems can be solved using a scalable reduced gradient method; (4) we point out strong connections with multiple kernel learning and multiple instance learning approaches. Experimental results in the PASCAL VOC 2007, MSR Action Dataset II and YouTube illustrate the benefits of our approach

arXiv.org e-Print Archive

FigShare

A Data-Driven Approach for Tag Refinement and Localization in Web Videos

Author: Ballan Lamberto
Bertini Marco
Del Bimbo Alberto
Serra Giuseppe
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Tagging of visual content is becoming more and more widespread as web-based services and social networks have popularized tagging functionalities among their users. These user-generated tags are used to ease browsing and exploration of media collections, e.g. using tag clouds, or to retrieve multimedia content. However, not all media are equally tagged by users. Using the current systems is easy to tag a single photo, and even tagging a part of a photo, like a face, has become common in sites like Flickr and Facebook. On the other hand, tagging a video sequence is more complicated and time consuming, so that users just tag the overall content of a video. In this paper we present a method for automatic video annotation that increases the number of tags originally provided by users, and localizes them temporally, associating tags to keyframes. Our approach exploits collective knowledge embedded in user-generated tags and web sources, and visual similarity of keyframes and images uploaded to social sites like YouTube and Flickr, as well as web sources like Google and Bing. Given a keyframe, our method is able to select on the fly from these visual sources the training exemplars that should be the most relevant for this test sample, and proceeds to transfer labels across similar images. Compared to existing video tagging approaches that require training classifiers for each tag, our system has few parameters, is easy to implement and can deal with an open vocabulary scenario. We demonstrate the approach on tag refinement and localization on DUT-WEBV, a large dataset of web videos, and show state-of-the-art results.Comment: Preprint submitted to Computer Vision and Image Understanding (CVIU

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Udine

Florence Research

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Archivio istituzionale della ricerca - Università di Padova

Discriminative Segment Annotation in Weakly Labeled Video

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Recommended from our members

Weakly-Supervised Temporal Activity Localization and Classification With Web Videos

Author: Dougherty Thomas
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

In this thesis, weakly-supervised temporal activity localization and classification is considered with the use of web videos. Most activity localization methods depend on the availability of frame-wise annotation, which is a burdensome task to collect. To reduce the effort of manual labeling, learning from weak labels may be used as a potential solution. Recently there has been a substantial influx of tagged videos on the Internet. These can potentially be used as a rich source of data for weakly-supervised training. The following problem is considered. Given only the keyword of an action, can videos be retrieved online and be used to train the Weakly-supervised Temporal Activity Localization and Classification (W-TALC) network? Then, can a re-ranking method be implemented to filter out noisy video data? Action categories of the Thumos14 dataset are used to search for videos online with Youtube Data API. These videos are used as a training set for the W-TALC network. Given only the video labels, the W-TALC network learns to both localize and classify actions in videos. Using a re-ranking strategy, noisy video data is removed and shows an increase in detection performance versus using the original web video dataset. Analysis of the web video dataset and results of the detection performance shows promise for the reliable use of web videos for training

eScholarship - University of California