7 research outputs found
Teaching Matters: Investigating the Role of Supervision in Vision Transformers
Vision Transformers (ViTs) have gained significant popularity in recent years
and have proliferated into many applications. However, their behavior under
different learning paradigms is not well explored. We compare ViTs trained
through different methods of supervision, and show that they learn a diverse
range of behaviors in terms of their attention, representations, and downstream
performance. We also discover ViT behaviors that are consistent across
supervision, including the emergence of Offset Local Attention Heads. These are
self-attention heads that attend to a token adjacent to the current token with
a fixed directional offset, a phenomenon that to the best of our knowledge has
not been highlighted in any prior work. Our analysis shows that ViTs are highly
flexible and learn to process local and global information in different orders
depending on their training method. We find that contrastive self-supervised
methods learn features that are competitive with explicitly supervised
features, and they can even be superior for part-level tasks. We also find that
the representations of reconstruction-based models show non-trivial similarity
to contrastive self-supervised models. Project website
(https://www.cs.umd.edu/~sakshams/vit_analysis) and code
(https://www.github.com/mwalmer-umd/vit_analysis) are publicly available.Comment: Website: see https://www.cs.umd.edu/~sakshams/vit_analysis. Code: see
https://www.github.com/mwalmer-umd/vit_analysis. The first two authors
contributed equally. Accepted to CVPR 2023 as conference pape
Gen2Det: Generate to Detect
Recently diffusion models have shown improvement in synthetic image quality
as well as better control in generation. We motivate and present Gen2Det, a
simple modular pipeline to create synthetic training data for object detection
for free by leveraging state-of-the-art grounded image generation methods.
Unlike existing works which generate individual object instances, require
identifying foreground followed by pasting on other images, we simplify to
directly generating scene-centric images. In addition to the synthetic data,
Gen2Det also proposes a suite of techniques to best utilize the generated data,
including image-level filtering, instance-level filtering, and better training
recipe to account for imperfections in the generation. Using Gen2Det, we show
healthy improvements on object detection and segmentation tasks under various
settings and agnostic to detection methods. In the long-tailed detection
setting on LVIS, Gen2Det improves the performance on rare categories by a large
margin while also significantly improving the performance on other categories,
e.g. we see an improvement of 2.13 Box AP and 1.84 Mask AP over just training
on real data on LVIS with Mask R-CNN. In the low-data regime setting on COCO,
Gen2Det consistently improves both Box and Mask AP by 2.27 and 1.85 points. In
the most general detection setting, Gen2Det still demonstrates robust
performance gains, e.g. it improves the Box and Mask AP on COCO by 0.45 and
0.32 points
Supervision and Data Dynamics in Vision Across Recognition and Generation Landscapes
This thesis looks at visual perception through the lens of supervision and data dynamics across recognition and generation landscapes. Generative and discriminative modeling form important pillars in computer vision. Depending on the task techniques to better learn and utilize the data and labels can change. Through this work we investigate different tasks along this landscape focusing on different supervision strategies, highlighting pitfalls in current approaches and propose modified architectures and losses to utilize the data better under different settings.
On the recognition side we start by analyzing Vision Transformers (ViTs) through a comprehensive analysis under varied supervision paradigms. We look at a mix of explicit supervision, contrastive self-supervision, and reconstructive self-supervision by delving into attention mechanisms and learned representations. We then look at a more specific case of supervision geared towards object detection which is called sparse supervision where their are missing annotations. We propose to utilize self and semi-supervised techniques to solve this task. Finally, we also explore a discovery style framework with applications on GAN generated image detection. Unlike sparse supervision discussed earlier, this scenario handles the case where are test time we have an unknown number of new classes. We were the first work proposing this problem where instead of just identifying synthetic images, we also try to group them based on their generation source. The exploration of Generative Adversarial Networks (GANs) in an open-world scenario uncovers the intricacies of learning with limited supervision for discovery style problems.
On the generation side we delve into different supervision strategies involving decomposing and decoupling representations. In the first work we tackle the problem of paired Image-to-Image (I2I) translation by decomposing supervision into reconstruction and residuals and highlight issues with traditional training approaches. We then look at generating talking head videos through two different kinds of supervision, video and audio. For driving the generation using a video we look at decoupling representations for the task of few-shot talking-head synthesis where the supervision is provided using only a few samples (shots). For this task we factorize the representation into spatial and style components which helps the learning. To supervise the generation additionally through audio, we look at multimodal supervision for lip-synchronized talking head generation. For this we incorporate audio and video modalities to synthesize lifelike talking-heads which can work even in in-the-wild scenarios.
In the last part we showcase two works which link our experiences from generation and recognition where we explore generative modeling to improve recognition models. The first work here utilizes the advancements in diffusion based image generation models to improve recognition models. Given the high fidelity and control of generation which diffusion models have brought, we utilize synthetic data from these models and create a suitable pipeline to utilize this data effectively to improve detection and segmentation performance. As a follow up to our ViT analysis we also propose a new technique to utilize off the shelf pretrained ViTs and generate high resolution features using a learnt lightweight feature transform. These high resolution features are especially effective for dense tasks like correspondence, segmentation, detection and object discovery
LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors
We present a simple self-supervised method to enhance the performance of ViT
features for dense downstream tasks. Our Lightweight Feature Transform (LiFT)
is a straightforward and compact postprocessing network that can be applied to
enhance the features of any pre-trained ViT backbone. LiFT is fast and easy to
train with a self-supervised objective, and it boosts the density of ViT
features for minimal extra inference cost. Furthermore, we demonstrate that
LiFT can be applied with approaches that use additional task-specific
downstream modules, as we integrate LiFT with ViTDet for COCO detection and
segmentation. Despite the simplicity of LiFT, we find that it is not simply
learning a more complex version of bilinear interpolation. Instead, our LiFT
training protocol leads to several desirable emergent properties that benefit
ViT features in dense downstream tasks. This includes greater scale invariance
for features, and better object boundary maps. By simply training LiFT for a
few epochs, we show improved performance on keypoint correspondence, detection,
segmentation, and object discovery tasks. Overall, LiFT provides an easy way to
unlock the benefits of denser feature arrays for a fraction of the
computational cost. For more details, refer to our project page at
https://www.cs.umd.edu/~sakshams/LiFT/
AI-based learning content generation and learning pathway augmentation to increase learner engagement
Retaining learner engagement is a major challenge in online learning environments, which is even more intensified with learning spaces increasingly built by combining resources from multiple independent sources. Narrative-centric learning experience has been found to improve learner engagement by several researchers. Towards this end, we propose an AI-based approach that generates auxiliary learning content called narrative fragments which are interspersed into the learning pathways to create interactive learning narratives. The proposed approach consists of the automatic generation of two types of narrative fragments– overviews of the learning pathway segments and reflection quizzes or formative assessments from learning resources in any format including open educational resources. The pipeline for the generation of the narrative fragments consists of various components based on different semantic models and a natural language generation (NLG) component based on a pre-trained language model GPT-2 (Generative Pre-trained Transformer 2). Automation enables the generation of narrative fragments on the fly whenever there are changes in the learning pathway due to the need for reiteration of concepts, pre-requisite knowledge acquisition, etc., enabling adaptability in the learning pathways. The proposed approach is domain agnostic which makes it easily adaptable to different domains. The NLG model is evaluated using ROUGE scores against several baselines. Automatically generated narrative fragments are evaluated by human evaluators. We obtained encouraging results in both cases