49 research outputs found

    Edge Guided GANs with Multi-Scale Contrastive Learning for Semantic Image Synthesis

    Full text link
    We propose a novel ECGAN for the challenging semantic image synthesis task. Although considerable improvements have been achieved by the community in the recent period, the quality of synthesized images is far from satisfactory due to three largely unresolved challenges. 1) The semantic labels do not provide detailed structural information, making it challenging to synthesize local details and structures; 2) The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss and thus cannot fully preserve the original semantic information, leading to semantically inconsistent results (e.g., missing small objects); 3) Existing semantic image synthesis methods focus on modeling 'local' semantic information from a single input semantic layout. However, they ignore 'global' semantic information of multiple input semantic layouts, i.e., semantic cross-relations between pixels across different input layouts. To tackle 1), we propose to use the edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. To tackle 2), we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout to preserve the semantic information. To tackle 3), inspired by current methods in contrastive learning, we propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content than those from different classes. We further propose a novel multi-scale contrastive learning method that aims to push same-class features from different scales closer together being able to capture more semantic relations by explicitly exploring the structures of labeled pixels from multiple input semantic layouts from different scales.Comment: Accepted to TPAMI, an extended version of a paper published in ICLR2023. arXiv admin note: substantial text overlap with arXiv:2003.1389

    Contrastive Learning for Diverse Disentangled Foreground Generation

    Full text link
    We introduce a new method for diverse foreground generation with explicit control over various factors. Existing image inpainting based foreground generation methods often struggle to generate diverse results and rarely allow users to explicitly control specific factors of variation (e.g., varying the facial identity or expression for face inpainting results). We leverage contrastive learning with latent codes to generate diverse foreground results for the same masked input. Specifically, we define two sets of latent codes, where one controls a pre-defined factor (``known''), and the other controls the remaining factors (``unknown''). The sampled latent codes from the two sets jointly bi-modulate the convolution kernels to guide the generator to synthesize diverse results. Experiments demonstrate the superiority of our method over state-of-the-arts in result diversity and generation controllability.Comment: ECCV 202

    Harnessing the power of diffusion models for plant disease image augmentation

    Get PDF
    IntroductionThe challenges associated with data availability, class imbalance, and the need for data augmentation are well-recognized in the field of plant disease detection. The collection of large-scale datasets for plant diseases is particularly demanding due to seasonal and geographical constraints, leading to significant cost and time investments. Traditional data augmentation techniques, such as cropping, resizing, and rotation, have been largely supplanted by more advanced methods. In particular, the utilization of Generative Adversarial Networks (GANs) for the creation of realistic synthetic images has become a focal point of contemporary research, addressing issues related to data scarcity and class imbalance in the training of deep learning models. Recently, the emergence of diffusion models has captivated the scientific community, offering superior and realistic output compared to GANs. Despite these advancements, the application of diffusion models in the domain of plant science remains an unexplored frontier, presenting an opportunity for groundbreaking contributions.MethodsIn this study, we delve into the principles of diffusion technology, contrasting its methodology and performance with state-of-the-art GAN solutions, specifically examining the guided inference model of GANs, named InstaGAN, and a diffusion-based model, RePaint. Both models utilize segmentation masks to guide the generation process, albeit with distinct principles. For a fair comparison, a subset of the PlantVillage dataset is used, containing two disease classes of tomato leaves and three disease classes of grape leaf diseases, as results on these classes have been published in other publications.ResultsQuantitatively, RePaint demonstrated superior performance over InstaGAN, with average Fréchet Inception Distance (FID) score of 138.28 and Kernel Inception Distance (KID) score of 0.089 ± (0.002), compared to InstaGAN’s average FID and KID scores of 206.02 and 0.159 ± (0.004) respectively. Additionally, RePaint’s FID scores for grape leaf diseases were 69.05, outperforming other published methods such as DCGAN (309.376), LeafGAN (178.256), and InstaGAN (114.28). For tomato leaf diseases, RePaint achieved an FID score of 161.35, surpassing other methods like WGAN (226.08), SAGAN (229.7233), and InstaGAN (236.61).DiscussionThis study offers valuable insights into the potential of diffusion models for data augmentation in plant disease detection, paving the way for future research in this promising field

    Generative Adversarial Networks (GANs): Challenges, Solutions, and Future Directions

    Full text link
    Generative Adversarial Networks (GANs) is a novel class of deep generative models which has recently gained significant attention. GANs learns complex and high-dimensional distributions implicitly over images, audio, and data. However, there exists major challenges in training of GANs, i.e., mode collapse, non-convergence and instability, due to inappropriate design of network architecture, use of objective function and selection of optimization algorithm. Recently, to address these challenges, several solutions for better design and optimization of GANs have been investigated based on techniques of re-engineered network architectures, new objective functions and alternative optimization algorithms. To the best of our knowledge, there is no existing survey that has particularly focused on broad and systematic developments of these solutions. In this study, we perform a comprehensive survey of the advancements in GANs design and optimization solutions proposed to handle GANs challenges. We first identify key research issues within each design and optimization technique and then propose a new taxonomy to structure solutions by key research issues. In accordance with the taxonomy, we provide a detailed discussion on different GANs variants proposed within each solution and their relationships. Finally, based on the insights gained, we present the promising research directions in this rapidly growing field.Comment: 42 pages, Figure 13, Table

    Towards Interaction-level Video Action Understanding

    Get PDF
    A huge amount of videos have been created, spread, and viewed daily. Among these massive videos, the actions and activities of humans account for a large part. We desire machines to understand human actions in videos as this is essential to various applications, including but not limited to autonomous driving cars, security systems, human-robot interactions and healthcare. Towards real intelligent system that is able to interact with humans, video understanding must go beyond simply answering ``what is the action in the video", but be more aware of what those actions mean to humans and be more in line with human thinking, which we call interactive-level action understanding. This thesis identifies three main challenges to approaching interactive-level video action understanding: 1) understanding actions given human consensus; 2) understanding actions based on specific human rules; 3) directly understanding actions in videos via human natural language. For the first challenge, we select video summary as a representative task that aims to select informative frames to retain high-level information based on human annotators' experience. Through self-attention architecture and meta-learning, which jointly process dual representations of visual and sequential information for video summarization, the proposed model is capable of understanding video from human consensus (e.g., how humans think which parts of an action sequence are essential). For the second challenge, our works on action quality assessment utilize transformer decoders to parse the input action into several sub-actions and assess the more fine-grained qualities of the given action, yielding the capability of action understanding given specific human rules. (e.g., how well a diving action performs, how well a robot performs surgery) The third key idea explored in this thesis is to use graph neural networks in an adversarial fashion to understand actions through natural language. We demonstrate the utility of this technique for the video captioning task, which takes an action video as input, outputs natural language, and yields state-of-the-art performance. It can be concluded that the research directions and methods introduced in this thesis provide fundamental components toward interactive-level action understanding
    corecore