1,889 research outputs found

    Learning to Detect and Retrieve Objects from Unlabeled Videos

    Full text link
    Learning an object detector or retrieval requires a large data set with manual annotations. Such data sets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose to exploit the natural correlation in narrations and the visual presence of objects in video, to learn an object detector and retrieval without any manual labeling involved. We pose the problem as weakly supervised learning with noisy labels, and propose a novel object detection paradigm under these constraints. We handle the background rejection by using contrastive samples and confront the high level of label noise with a new clustering score. Our evaluation is based on a set of 11 manually annotated objects in over 5000 frames. We show comparison to a weakly-supervised approach as baseline and provide a strongly labeled upper bound.Comment: ICCV 2019 Workshop on Multi-modal Video Analysis and Moments in Time Challeng

    Large-Scale Object Discovery and Detector Adaptation from Unlabeled Video

    Full text link
    We explore object discovery and detector adaptation based on unlabeled video sequences captured from a mobile platform. We propose a fully automatic approach for object mining from video which builds upon a generic object tracking approach. By applying this method to three large video datasets from autonomous driving and mobile robotics scenarios, we demonstrate its robustness and generality. Based on the object mining results, we propose a novel approach for unsupervised object discovery by appearance-based clustering. We show that this approach successfully discovers interesting objects relevant to driving scenarios. In addition, we perform self-supervised detector adaptation in order to improve detection performance on the KITTI dataset for existing categories. Our approach has direct relevance for enabling large-scale object learning for autonomous driving.Comment: CVPR'18 submissio

    Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision

    Full text link
    Supervised machine learning based state-of-the-art computer vision techniques are in general data hungry. Their data curation poses the challenges of expensive human labeling, inadequate computing resources and larger experiment turn around times. Training data subset selection and active learning techniques have been proposed as possible solutions to these challenges. A special class of subset selection functions naturally model notions of diversity, coverage and representation and can be used to eliminate redundancy thus lending themselves well for training data subset selection. They can also help improve the efficiency of active learning in further reducing human labeling efforts by selecting a subset of the examples obtained using the conventional uncertainty sampling based techniques. In this work, we empirically demonstrate the effectiveness of two diversity models, namely the Facility-Location and Dispersion models for training-data subset selection and reducing labeling effort. We demonstrate this across the board for a variety of computer vision tasks including Gender Recognition, Face Recognition, Scene Recognition, Object Detection and Object Recognition. Our results show that diversity based subset selection done in the right way can increase the accuracy by upto 5 - 10% over existing baselines, particularly in settings in which less training data is available. This allows the training of complex machine learning models like Convolutional Neural Networks with much less training data and labeling costs while incurring minimal performance loss.Comment: Accepted to WACV 2019. arXiv admin note: substantial text overlap with arXiv:1805.1119

    Adversarial Framework for Unsupervised Learning of Motion Dynamics in Videos

    Full text link
    Human behavior understanding in videos is a complex, still unsolved problem and requires to accurately model motion at both the local (pixel-wise dense prediction) and global (aggregation of motion cues) levels. Current approaches based on supervised learning require large amounts of annotated data, whose scarce availability is one of the main limiting factors to the development of general solutions. Unsupervised learning can instead leverage the vast amount of videos available on the web and it is a promising solution for overcoming the existing limitations. In this paper, we propose an adversarial GAN-based framework that learns video representations and dynamics through a self-supervision mechanism in order to perform dense and global prediction in videos. Our approach synthesizes videos by 1) factorizing the process into the generation of static visual content and motion, 2) learning a suitable representation of a motion latent space in order to enforce spatio-temporal coherency of object trajectories, and 3) incorporating motion estimation and pixel-wise dense prediction into the training procedure. Self-supervision is enforced by using motion masks produced by the generator, as a co-product of its generation process, to supervise the discriminator network in performing dense prediction. Performance evaluation, carried out on standard benchmarks, shows that our approach is able to learn, in an unsupervised way, both local and global video dynamics. The learned representations, then, support the training of video object segmentation methods with sensibly less (about 50%) annotations, giving performance comparable to the state of the art. Furthermore, the proposed method achieves promising performance in generating realistic videos, outperforming state-of-the-art approaches especially on motion-related metrics

    Multimodal Co-Training for Selecting Good Examples from Webly Labeled Video

    Full text link
    We tackle the problem of learning concept classifiers from videos on the web without using manually labeled data. Although metadata attached to videos (e.g., video titles, descriptions) can be of help collecting training data for the target concept, the collected data is often very noisy. The main challenge is therefore how to select good examples from noisy training data. Previous approaches firstly learn easy examples that are unlikely to be noise and then gradually learn more complex examples. However, hard examples that are much different from easy ones are never learned. In this paper, we propose an approach called multimodal co-training (MMCo) for selecting good examples from noisy training data. MMCo jointly learns classifiers for multiple modalities that complement each other to select good examples. Since MMCo selects examples by consensus of multimodal classifiers, a hard example for one modality can still be used as a training example by exploiting the power of the other modalities. The algorithm is very simple and easily implemented but yields consistent and significant boosts in example selection and classification performance on the FCVID and YouTube8M benchmarks

    Self-Training for Domain Adaptive Scene Text Detection

    Full text link
    Though deep learning based scene text detection has achieved great progress, well-trained detectors suffer from severe performance degradation for different domains. In general, a tremendous amount of data is indispensable to train the detector in the target domain. However, data collection and annotation are expensive and time-consuming. To address this problem, we propose a self-training framework to automatically mine hard examples with pseudo-labels from unannotated videos or images. To reduce the noise of hard examples, a novel text mining module is implemented based on the fusion of detection and tracking results. Then, an image-to-video generation method is designed for the tasks that videos are unavailable and only images can be used. Experimental results on standard benchmarks, including ICDAR2015, MSRA-TD500, ICDAR2017 MLT, demonstrate the effectiveness of our self-training method. The simple Mask R-CNN adapted with self-training and fine-tuned on real data can achieve comparable or even superior results with the state-of-the-art methods

    Learning Spatiotemporal Features via Video and Text Pair Discrimination

    Full text link
    Current video representations heavily rely on learning from manually annotated video datasets which are time-consuming and expensive to acquire. We observe videos are naturally accompanied by abundant text information such as YouTube titles and Instagram captions. In this paper, we leverage this visual-textual connection to learn spatiotemporal features in an efficient weakly-supervised manner. We present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text. Specifically, we adopt noise-contrastive estimation to tackle the computational issue imposed by the huge amount of pair instance classes and design a practical curriculum learning strategy. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning, the learnt models obtain competitive results for action classification on Kinetics under the linear classification protocol. Moreover, our visual model provides an effective initialization to fine-tune on downstream tasks, which yields a remarkable performance gain for action recognition on UCF101 and HMDB51, compared with the existing state-of-the-art self-supervised training methods. In addition, our CPD model yields a new state of the art for zero-shot action recognition on UCF101 by directly utilizing the learnt visual-textual embeddings. The code will be made available at https://github.com/MCG-NJU/CPD-Video.Comment: Technical Repor

    Improving Object Detection with Selective Self-supervised Self-training

    Full text link
    We study how to leverage Web images to augment human-curated object detection datasets. Our approach is two-pronged. On the one hand, we retrieve Web images by image-to-image search, which incurs less domain shift from the curated data than other search methods. The Web images are diverse, supplying a wide variety of object poses, appearances, their interactions with the context, etc. On the other hand, we propose a novel learning method motivated by two parallel lines of work that explore unlabeled data for image classification: self-training and self-supervised learning. They fail to improve object detectors in their vanilla forms due to the domain gap between the Web images and curated datasets. To tackle this challenge, we propose a selective net to rectify the supervision signals in Web images. It not only identifies positive bounding boxes but also creates a safe zone for mining hard negative boxes. We report state-of-the-art results on detecting backpacks and chairs from everyday scenes, along with other challenging object classes.Comment: Accepted to ECCV 202

    Long and Short Memory Balancing in Visual Co-Tracking using Q-Learning

    Full text link
    Employing one or more additional classifiers to break the self-learning loop in tracing-by-detection has gained considerable attention. Most of such trackers merely utilize the redundancy to address the accumulating label error in the tracking loop, and suffer from high computational complexity as well as tracking challenges that may interrupt all classifiers (e.g. temporal occlusions). We propose the active co-tracking framework, in which the main classifier of the tracker labels samples of the video sequence, and only consults auxiliary classifier when it is uncertain. Based on the source of the uncertainty and the differences of two classifiers (e.g. accuracy, speed, update frequency, etc.), different policies should be taken to exchange the information between two classifiers. Here, we introduce a reinforcement learning approach to find the appropriate policy by considering the state of the tracker in a specific sequence. The proposed method yields promising results in comparison to the best tracking-by-detection approaches.Comment: Submitted to ICIP 201

    Multigrid Predictive Filter Flow for Unsupervised Learning on Videos

    Full text link
    We introduce multigrid Predictive Filter Flow (mgPFF), a framework for unsupervised learning on videos. The mgPFF takes as input a pair of frames and outputs per-pixel filters to warp one frame to the other. Compared to optical flow used for warping frames, mgPFF is more powerful in modeling sub-pixel movement and dealing with corruption (e.g., motion blur). We develop a multigrid coarse-to-fine modeling strategy that avoids the requirement of learning large filters to capture large displacement. This allows us to train an extremely compact model (4.6MB) which operates in a progressive way over multiple resolutions with shared weights. We train mgPFF on unsupervised, free-form videos and show that mgPFF is able to not only estimate long-range flow for frame reconstruction and detect video shot transitions, but also readily amendable for video object segmentation and pose tracking, where it substantially outperforms the published state-of-the-art without bells and whistles. Moreover, owing to mgPFF's nature of per-pixel filter prediction, we have the unique opportunity to visualize how each pixel is evolving during solving these tasks, thus gaining better interpretability.Comment: webpage (https://www.ics.uci.edu/~skong2/mgpff.html
    • …
    corecore