501 research outputs found
Recommended from our members
DATA-DRIVEN APPROACH TO IMAGE CLASSIFICATION
Image classification has been a core topic in the computer vision community. Its recent success with convolutional neural network (CNN) algorithm has led to various real world applications such as large scale management of photos/videos on cloud/social-media, image based search for online retailers, self-driving cars, building robots and healthcare. Image classification can be broadly categorized into binary, multi-class and multi-label classification problems. Binary classification involves assigning one of the two class labels to an instance. In multi-class classification problem, an instance should be categorized into one of more than two classes. Multi-label classification is a generalized version of the multi-class classification problem where each image is assigned multiple labels as opposed to a single label.
In this work, we first present various methods that take advantage of deep representations (fully connected layer of pre-trained CNN on the ImageNet dataset) and yield better performance on multi-label classification when compared to methods that use over a dozen conventional visual features. Following the success of deep representations, we intend to build a generic end-to-end deep learning framework to address all three problem categories of image classification. However, there are still no well established guidelines (in terms of choosing the number of layers to go deeper, the number of kernels and the size, the type of regularizer, the choice of non-linear function, etc.) to build an efficient deep neural network and often network architecture design is specific to a problem/dataset. Hence, we present some initial efforts in building a computational framework called Deep Decision Network (DDN) which is completely data-driven. DDN is a tree-like structured built stage-wise. During the learning phase, starting from the root network node, DDN automatically builds a network that splits the data into disjoint clusters of classes which would be handled by the subsequent expert networks. This results in a tree-like structured network driven by the data. The proposed approach provides an insight into the data by identifying the group of classes that are hard to classify and require more attention when compared to others. This feature is crucial for people trying to solve the problem with little or no domain knowledge, especially for applications in medical domain. Initially, we evaluate DDN on a binary classification problem and later extend it to more challenging multi-class and multi-label classification problems. The extension of DDN to multi-class and multi-label involves some changes but they still operate under the same underlying principle. In all the three cases, the proposed approach is tested for its recognition performance and scalability on publicly available datasets providing comparison to other methods
A Survey on Deep Learning in Medical Image Analysis
Deep learning algorithms, in particular convolutional networks, have rapidly
become a methodology of choice for analyzing medical images. This paper reviews
the major deep learning concepts pertinent to medical image analysis and
summarizes over 300 contributions to the field, most of which appeared in the
last year. We survey the use of deep learning for image classification, object
detection, segmentation, registration, and other tasks and provide concise
overviews of studies per application area. Open challenges and directions for
future research are discussed.Comment: Revised survey includes expanded discussion section and reworked
introductory section on common deep architectures. Added missed papers from
before Feb 1st 201
Matryoshka Diffusion Models
Diffusion models are the de facto approach for generating high-quality images
and videos, but learning high-dimensional models remains a formidable task due
to computational and optimization challenges. Existing methods often resort to
training cascaded models in pixel space or using a downsampled latent space of
a separately trained auto-encoder. In this paper, we introduce Matryoshka
Diffusion Models(MDM), an end-to-end framework for high-resolution image and
video synthesis. We propose a diffusion process that denoises inputs at
multiple resolutions jointly and uses a NestedUNet architecture where features
and parameters for small-scale inputs are nested within those of large scales.
In addition, MDM enables a progressive training schedule from lower to higher
resolutions, which leads to significant improvements in optimization for
high-resolution generation. We demonstrate the effectiveness of our approach on
various benchmarks, including class-conditioned image generation,
high-resolution text-to-image, and text-to-video applications. Remarkably, we
can train a single pixel-space model at resolutions of up to 1024x1024 pixels,
demonstrating strong zero-shot generalization using the CC12M dataset, which
contains only 12 million images.Comment: 28 pages, 18 figure
Re-Imagen: Retrieval-Augmented Text-to-Image Generator
Research on text-to-image generation has witnessed significant progress in
generating diverse and photo-realistic images, driven by diffusion and
auto-regressive models trained on large-scale image-text data. Though
state-of-the-art models can generate high-quality images of common entities,
they often have difficulty generating images of uncommon entities, such as
`Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the
Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model
that uses retrieved information to produce high-fidelity and faithful images,
even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an
external multi-modal knowledge base to retrieve relevant (image, text) pairs
and uses them as references to generate the image. With this retrieval step,
Re-Imagen is augmented with the knowledge of high-level semantics and low-level
visual details of the mentioned entities, and thus improves its accuracy in
generating the entities' visual appearances. We train Re-Imagen on a
constructed dataset containing (image, text, retrieval) triples to teach the
model to ground on both text prompt and retrieval. Furthermore, we develop a
new sampling strategy to interleave the classifier-free guidance for text and
retrieval conditions to balance the text and retrieval alignment. Re-Imagen
achieves significant gain on FID score over COCO and WikiImage. To further
evaluate the capabilities of the model, we introduce EntityDrawBench, a new
benchmark that evaluates image generation for diverse entities, from frequent
to rare, across multiple object categories including dogs, foods, landmarks,
birds, and characters. Human evaluation on EntityDrawBench shows that Re-Imagen
can significantly improve the fidelity of generated images, especially on less
frequent entities.Comment: 9 page
RobustLoc: Robust Camera Pose Regression in Challenging Driving Environments
Camera relocalization has various applications in autonomous driving.
Previous camera pose regression models consider only ideal scenarios where
there is little environmental perturbation. To deal with challenging driving
environments that may have changing seasons, weather, illumination, and the
presence of unstable objects, we propose RobustLoc, which derives its
robustness against perturbations from neural differential equations. Our model
uses a convolutional neural network to extract feature maps from multi-view
images, a robust neural differential equation diffusion block module to diffuse
information interactively, and a branched pose decoder with multi-layer
training to estimate the vehicle poses. Experiments demonstrate that RobustLoc
surpasses current state-of-the-art camera pose regression models and achieves
robust performance in various environments. Our code is released at:
https://github.com/sijieaaa/RobustLo
Deep Constrained Dominant Sets for Person Re-Identification
In this work, we propose an end-to-end constrained clustering scheme to tackle the person re-identification (re-id) problem. Deep neural networks (DNN) have recently proven to be effective on person re-identification task. In particular, rather than leveraging solely a probe-gallery similarity, diffusing the similarities among the gallery images in an end-to-end manner has proven to be effective in yielding a robust probe-gallery affinity. However, existing methods do not apply probe image as a constraint, and are prone to noise propagation during the similarity diffusion process. To overcome this, we propose an intriguing scheme which treats person-image retrieval problem as a constrained clustering optimization problem, called deep constrained dominant sets (DCDS). Given a probe and gallery images, we re-formulate person re-id problem as finding a constrained cluster, where the probe image is taken as a constraint (seed) and each cluster corresponds to a set of images corresponding to the same person. By optimizing the constrained clustering in an end-to-end manner, we naturally leverage the contextual knowledge of a set of images corresponding to the given person-images. We further enhance the performance by integrating an auxiliary net alongside DCDS, which employs a multi-scale ResNet. To validate the effectiveness of our method we present experiments on several benchmark datasets and show that the proposed method can outperform state-of-the-art methods
RePoseDM: Recurrent Pose Alignment and Gradient Guidance for Pose Guided Image Synthesis
Pose-guided person image synthesis task requires re-rendering a reference
image, which should have a photorealistic appearance and flawless pose
transfer. Since person images are highly structured, existing approaches
require dense connections for complex deformations and occlusions because these
are generally handled through multi-level warping and masking in latent space.
But the feature maps generated by convolutional neural networks do not have
equivariance, and hence even the multi-level warping does not have a perfect
pose alignment. Inspired by the ability of the diffusion model to generate
photorealistic images from the given conditional guidance, we propose recurrent
pose alignment to provide pose-aligned texture features as conditional
guidance. Moreover, we propose gradient guidance from pose interaction fields,
which output the distance from the valid pose manifold given a target pose as
input. This helps in learning plausible pose transfer trajectories that result
in photorealism and undistorted texture details. Extensive results on two
large-scale benchmarks and a user study demonstrate the ability of our proposed
approach to generate photorealistic pose transfer under challenging scenarios.
Additionally, we prove the efficiency of gradient guidance in pose-guided image
generation on the HumanArt dataset with fine-tuned stable diffusion.Comment: 10 pages, 4 tables, 7 figure
Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work
Inspired by the fact that human brains can emphasize discriminative parts of
the input and suppress irrelevant ones, substantial local mechanisms have been
designed to boost the development of computer vision. They can not only focus
on target parts to learn discriminative local representations, but also process
information selectively to improve the efficiency. In terms of application
scenarios and paradigms, local mechanisms have different characteristics. In
this survey, we provide a systematic review of local mechanisms for various
computer vision tasks and approaches, including fine-grained visual
recognition, person re-identification, few-/zero-shot learning, multi-modal
learning, self-supervised learning, Vision Transformers, and so on.
Categorization of local mechanisms in each field is summarized. Then,
advantages and disadvantages for every category are analyzed deeply, leaving
room for exploration. Finally, future research directions about local
mechanisms have also been discussed that may benefit future works. To the best
our knowledge, this is the first survey about local mechanisms on computer
vision. We hope that this survey can shed light on future research in the
computer vision field
- …