130 research outputs found

    Improving Object Detection with Deep Convolutional Networks via Bayesian Optimization and Structured Prediction

    Full text link
    Object detection systems based on the deep convolutional neural network (CNN) have recently made ground- breaking advances on several object detection benchmarks. While the features learned by these high-capacity neural networks are discriminative for categorization, inaccurate localization is still a major source of error for detection. Building upon high-capacity CNN architectures, we address the localization problem by 1) using a search algorithm based on Bayesian optimization that sequentially proposes candidate regions for an object bounding box, and 2) training the CNN with a structured loss that explicitly penalizes the localization inaccuracy. In experiments, we demonstrated that each of the proposed methods improves the detection performance over the baseline method on PASCAL VOC 2007 and 2012 datasets. Furthermore, two methods are complementary and significantly outperform the previous state-of-the-art when combined.Comment: CVPR 201

    ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

    Full text link
    We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and contrastive learning. ViC-MAE is trained using a global featured obtained by pooling the local representations learned under an MAE reconstruction loss and leveraging this representation under a contrastive objective across images and video frames. We show that visual representations learned under ViC-MAE generalize well to both video and image classification tasks. Particularly, ViC-MAE obtains state-of-the-art transfer learning performance from video to images on Imagenet-1k compared to the recently proposed OmniMAE by achieving a top-1 accuracy of 86% (+1.3% absolute improvement) when trained on the same data and 87.1% (+2.4% absolute improvement) when training on extra data. At the same time ViC-MAE outperforms most other methods on video benchmarks by obtaining 75.9% top-1 accuracy on the challenging Something something-v2 video benchmark . When training on videos and images from a diverse combination of datasets, our method maintains a balanced transfer-learning performance between video and image classification benchmarks, coming only as a close second to the best supervised method.Comment: More results on Video an Image datasets, ViC-MAE now supports training on videos and image

    Modeling Structured Dynamics with Deep Neural Networks

    Full text link
    Neural networks have become powerful machinery for identifying patterns from raw input data from large amounts of data. Research adopting neural networks has excelled in tasks such as object recognition, reinforcement learning, speech recognition, image in-painting, amongst others. Previous works have notably excelled at inferring information about the input data; either from sequence of frames or single frames. However, very few works have focused on modeling structured motion dynamics for generative tasks. Structured motion is defined as the constant topological configuration of objects maintained through time. In this thesis, I develop new neural networks that effectively model structured motion dynamics useful for generative tasks such as future motion prediction and transfer. Accurate structured dynamic models are an important piece in achieving general artificial intelligence. It has been shown that agents equipped with such models can learn from environments with far less interactions due to being able to predict the consequences of their actions. Additionally, accurate motion dynamic models are be useful for applications such as motion editing, motion transfer, and others. Such applications can enhance visual artists ability to create content for the web or can assist movie makers when transferring motion from actors into movie characters with minimal effort. This thesis initially presents motion dynamics models in two dimensions: I first present a neural network architecture that decomposes video into two information pathways that deal with video dynamics and frame spatial layout separately. The two pathways are later combined to generate future frames that contain highly structured objects moving. Second, I propose to take it a step further by having a motion stream that is visually interpretable. Specifically, there is a motion stream that predicts structured motion dynamics as landmarks of the moving structures that evolve through time, and there is an image generation module that generates future frames given the landmarks and a single frame from the past using image analogy principles. Next, we keep the image analogy principles of our previous work, however, we formulate the video prediction problem such that general features for moving objects structures are learned. Finally, by taking advantage of recent advances in computational devices for large scale deep learning research, I present a study on the effects of maximal capacity and minimal inductive bias of neural networks based video prediction frameworks. From our very thorough evaluation and experimentation, we find that network capacity plays a very important role in the performance of deep networks for video prediction that can be applied to any of the previously investigated methods. Consequently, this thesis presents motion dynamics models in three dimensions: I propose a neural kinematics network with adversarial cycle consistency. Specifically, I propose a layer based on the kinematic equations that takes advantage of the backpropagation algorithm used to optimize neural networks to automatically discover rotation angles that represent pure motion which can be used for motion transfer from one kinematic structure into another. Because of the unsupervised nature of learning, the learned model generalizes to never before seen human video from which motion data is extracted using an off-the-shelf algorithm. Overall, this thesis focuses on modeling structured dynamics using the representational power of deep neural networks. Modeling structured dynamics is an important problem in both general artificial intelligence, as well as, in applications dealing video editing, video generation, video understanding and animation.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/153399/1/rubville_1.pd
    corecore