501 research outputs found

    A Survey on Deep Learning in Medical Image Analysis

    Full text link
    Deep learning algorithms, in particular convolutional networks, have rapidly become a methodology of choice for analyzing medical images. This paper reviews the major deep learning concepts pertinent to medical image analysis and summarizes over 300 contributions to the field, most of which appeared in the last year. We survey the use of deep learning for image classification, object detection, segmentation, registration, and other tasks and provide concise overviews of studies per application area. Open challenges and directions for future research are discussed.Comment: Revised survey includes expanded discussion section and reworked introductory section on common deep architectures. Added missed papers from before Feb 1st 201

    Matryoshka Diffusion Models

    Full text link
    Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion Models(MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions, which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.Comment: 28 pages, 18 figure

    Re-Imagen: Retrieval-Augmented Text-to-Image Generator

    Full text link
    Research on text-to-image generation has witnessed significant progress in generating diverse and photo-realistic images, driven by diffusion and auto-regressive models trained on large-scale image-text data. Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities, such as `Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs and uses them as references to generate the image. With this retrieval step, Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities' visual appearances. We train Re-Imagen on a constructed dataset containing (image, text, retrieval) triples to teach the model to ground on both text prompt and retrieval. Furthermore, we develop a new sampling strategy to interleave the classifier-free guidance for text and retrieval conditions to balance the text and retrieval alignment. Re-Imagen achieves significant gain on FID score over COCO and WikiImage. To further evaluate the capabilities of the model, we introduce EntityDrawBench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple object categories including dogs, foods, landmarks, birds, and characters. Human evaluation on EntityDrawBench shows that Re-Imagen can significantly improve the fidelity of generated images, especially on less frequent entities.Comment: 9 page

    RobustLoc: Robust Camera Pose Regression in Challenging Driving Environments

    Full text link
    Camera relocalization has various applications in autonomous driving. Previous camera pose regression models consider only ideal scenarios where there is little environmental perturbation. To deal with challenging driving environments that may have changing seasons, weather, illumination, and the presence of unstable objects, we propose RobustLoc, which derives its robustness against perturbations from neural differential equations. Our model uses a convolutional neural network to extract feature maps from multi-view images, a robust neural differential equation diffusion block module to diffuse information interactively, and a branched pose decoder with multi-layer training to estimate the vehicle poses. Experiments demonstrate that RobustLoc surpasses current state-of-the-art camera pose regression models and achieves robust performance in various environments. Our code is released at: https://github.com/sijieaaa/RobustLo

    Deep Constrained Dominant Sets for Person Re-Identification

    Get PDF
    In this work, we propose an end-to-end constrained clustering scheme to tackle the person re-identification (re-id) problem. Deep neural networks (DNN) have recently proven to be effective on person re-identification task. In particular, rather than leveraging solely a probe-gallery similarity, diffusing the similarities among the gallery images in an end-to-end manner has proven to be effective in yielding a robust probe-gallery affinity. However, existing methods do not apply probe image as a constraint, and are prone to noise propagation during the similarity diffusion process. To overcome this, we propose an intriguing scheme which treats person-image retrieval problem as a constrained clustering optimization problem, called deep constrained dominant sets (DCDS). Given a probe and gallery images, we re-formulate person re-id problem as finding a constrained cluster, where the probe image is taken as a constraint (seed) and each cluster corresponds to a set of images corresponding to the same person. By optimizing the constrained clustering in an end-to-end manner, we naturally leverage the contextual knowledge of a set of images corresponding to the given person-images. We further enhance the performance by integrating an auxiliary net alongside DCDS, which employs a multi-scale ResNet. To validate the effectiveness of our method we present experiments on several benchmark datasets and show that the proposed method can outperform state-of-the-art methods

    RePoseDM: Recurrent Pose Alignment and Gradient Guidance for Pose Guided Image Synthesis

    Full text link
    Pose-guided person image synthesis task requires re-rendering a reference image, which should have a photorealistic appearance and flawless pose transfer. Since person images are highly structured, existing approaches require dense connections for complex deformations and occlusions because these are generally handled through multi-level warping and masking in latent space. But the feature maps generated by convolutional neural networks do not have equivariance, and hence even the multi-level warping does not have a perfect pose alignment. Inspired by the ability of the diffusion model to generate photorealistic images from the given conditional guidance, we propose recurrent pose alignment to provide pose-aligned texture features as conditional guidance. Moreover, we propose gradient guidance from pose interaction fields, which output the distance from the valid pose manifold given a target pose as input. This helps in learning plausible pose transfer trajectories that result in photorealism and undistorted texture details. Extensive results on two large-scale benchmarks and a user study demonstrate the ability of our proposed approach to generate photorealistic pose transfer under challenging scenarios. Additionally, we prove the efficiency of gradient guidance in pose-guided image generation on the HumanArt dataset with fine-tuned stable diffusion.Comment: 10 pages, 4 tables, 7 figure

    Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work

    Full text link
    Inspired by the fact that human brains can emphasize discriminative parts of the input and suppress irrelevant ones, substantial local mechanisms have been designed to boost the development of computer vision. They can not only focus on target parts to learn discriminative local representations, but also process information selectively to improve the efficiency. In terms of application scenarios and paradigms, local mechanisms have different characteristics. In this survey, we provide a systematic review of local mechanisms for various computer vision tasks and approaches, including fine-grained visual recognition, person re-identification, few-/zero-shot learning, multi-modal learning, self-supervised learning, Vision Transformers, and so on. Categorization of local mechanisms in each field is summarized. Then, advantages and disadvantages for every category are analyzed deeply, leaving room for exploration. Finally, future research directions about local mechanisms have also been discussed that may benefit future works. To the best our knowledge, this is the first survey about local mechanisms on computer vision. We hope that this survey can shed light on future research in the computer vision field
    corecore