75 research outputs found

    Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

    Full text link
    Spatial control is a core capability in controllable image generation. Advancements in layout-guided image generation have shown promising results on in-distribution (ID) datasets with similar spatial configurations. However, it is unclear how these models perform when facing out-of-distribution (OOD) samples with arbitrary, unseen layouts. In this paper, we propose LayoutBench, a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape. We benchmark two recent representative layout-guided image generation methods and observe that the good ID layout control may not generalize well to arbitrary layouts in the wild (e.g., objects at the boundary). Next, we propose IterInpaint, a new baseline that generates foreground and background regions in a step-by-step manner via inpainting, demonstrating stronger generalizability than existing models on OOD layouts in LayoutBench. We perform quantitative and qualitative evaluation and fine-grained analysis on the four LayoutBench skills to pinpoint the weaknesses of existing models. Lastly, we show comprehensive ablation studies on IterInpaint, including training task ratio, crop&paste vs. repaint, and generation order. Project website: https://layoutbench.github.ioComment: 22 pages; Project website: https://layoutbench.github.i

    An Empirical Study of Multimodal Model Merging

    Full text link
    Model merging (e.g., via interpolation or task arithmetic) fuses multiple models trained on different tasks to generate a multi-task solution. The technique has been proven successful in previous studies, where the models are trained on similar tasks and with the same initialization. In this paper, we expand on this concept to a multimodal setup by merging transformers trained on different modalities. Furthermore, we conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture to create a parameter-efficient modality-agnostic architecture. Through comprehensive experiments, we systematically investigate the key factors impacting model performance after merging, including initialization, merging mechanisms, and model architectures. Our analysis leads to an effective training recipe for matching the performance of the modality-agnostic baseline (i.e. pre-trained from scratch) via model merging. Our code is available at: https://github.com/ylsung/vl-mergin

    Clinical characteristics of cystic encephalomalacia in children

    Get PDF
    PurposeTo investigate the primary causes and clinical characteristics of cystic encephalomalacia (CE) in children.MethodsThe clinical data of 50 children who were admitted to our hospital due to CE between January 2008 and December 2020 were retrospectively reviewed. Their primary causes, clinical manifestations and cranial magnetic resonance imaging features were analyzed.ResultsAmong all patients, 5 had prematurity, 19 had hypoxic-ischemic encephalopathy (HIE), 13 had intracranial infection, 14 had traumatic brain injury and hemorrhage, 4 had cerebral infarction, 2 had congenital genetic diseases, and 1 had hypoglycemia. The average time from primary disease onset to CE diagnosis was 70.1 ± 61.0 days. The clinical manifestations included speech or motor developmental delay (n = 33), epilepsy (n = 31), dystonia (n = 27), limb paralysis (n = 16), and visual or auditory impairment (n = 5). Patients with HIE as the primary cause of CE had a significantly higher occurrence of dystonia, while a significantly higher incidence of paralysis was observed in those with cerebral infarction as the primary cause.ConclusionCE in children is mainly caused by HIE, intracranial infection, and cerebral hemorrhage. The major clinical manifestations included speech or motor developmental delay, epilepsy, and dystonia. Magnetic resonance imaging is an important tool for the diagnosis of CE

    Prompting GPT-3 To Be Reliable

    Full text link
    Large language models (LLMs) show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. However, the crucial problem of how to improve the reliability of GPT-3 is still under-explored. While reliability is a broad and vaguely defined term, we decompose reliability into four main facets that correspond to the existing framework of ML safety and are well-recognized to be important: generalizability, social biases, calibration, and factuality. Our core contribution is to establish simple and effective prompts that improve GPT-3's reliability as it: 1) generalizes out-of-distribution, 2) balances demographic distribution and uses natural language instructions to reduce social biases, 3) calibrates output probabilities, and 4) updates the LLM's factual knowledge and reasoning chains. With appropriate prompts, GPT-3 is more reliable than smaller-scale supervised models on all these facets. We release all processed datasets, evaluation scripts, and model predictions. Our systematic empirical study not only sheds new insights on the reliability of prompting LLMs, but more importantly, our prompting strategies can help practitioners more reliably use LLMs like GPT-3.Comment: ICLR 202

    An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

    Full text link
    Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.Comment: CVPR'23; the first two authors contributed equally; code is available at https://github.com/tsujuifu/pytorch_empirical-mv

    ReCo: Region-Controlled Text-to-Image Generation

    Full text link
    Recently, large-scale text-to-image (T2I) models have shown impressive performance in generating high-fidelity images, but with limited controllability, e.g., precisely specifying the content in a specific region with a free-form text description. In this paper, we propose an effective technique for such regional control in T2I generation. We augment T2I models' inputs with an extra set of position tokens, which represent the quantized spatial coordinates. Each region is specified by four position tokens to represent the top-left and bottom-right corners, followed by an open-ended natural language regional description. Then, we fine-tune a pre-trained T2I model with such new input interface. Our model, dubbed as ReCo (Region-Controlled T2I), enables the region control for arbitrary objects described by open-ended regional texts rather than by object labels from a constrained category set. Empirically, ReCo achieves better image quality than the T2I model strengthened by positional words (FID: 8.82->7.36, SceneFID: 15.54->6.51 on COCO), together with objects being more accurately placed, amounting to a 20.40% region classification accuracy improvement on COCO. Furthermore, we demonstrate that ReCo can better control the object count, spatial relationship, and region attributes such as color/size, with the free-form regional description. Human evaluation on PaintSkill shows that ReCo is +19.28% and +17.21% more accurate in generating images with correct object count and spatial relationship than the T2I model

    K-LITE: Learning Transferable Visual Models with External Knowledge

    Full text link
    Recent state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This free form of supervision ensures high generality and usability of the learned visual models, based on extensive heuristics on data collection to cover as many visual concepts as possible. Alternatively, learning with external knowledge about images is a promising way which leverages a much more structured source of supervision. In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge to build transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that can understand both visual concepts and their knowledge; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods.Comment: Preprint. The first three authors contribute equall

    Toward an intensive understanding of sewer sediment prokaryotic community assembly and function

    Get PDF
    Prokaryotic communities play important roles in sewer sediment ecosystems, but the community composition, functional potential, and assembly mechanisms of sewer sediment prokaryotic communities are still poorly understood. Here, we studied the sediment prokaryotic communities in different urban functional areas (multifunctional, commercial, and residential areas) through 16S rRNA gene amplicon sequencing. Our results suggested that the compositions of prokaryotic communities varied significantly among functional areas. Desulfomicrobium, Desulfovibrio, and Desulfobacter involved in the sulfur cycle and some hydrolytic fermentation bacteria were enriched in multifunctional area, while Methanospirillum and Methanoregulaceae, which were related to methane metabolism were significantly discriminant taxa in the commercial area. Physicochemical properties were closely related to overall community changes (p < 0.001), especially the nutrient levels of sediments (i.e., total nitrogen and total phosphorus) and sediment pH. Network analysis revealed that the prokaryotic community network of the residential area sediment was more complex than the other functional areas, suggesting higher stability of the prokaryotic community in the residential area. Stochastic processes dominated the construction of the prokaryotic community. These results expand our understanding of the characteristics of prokaryotic communities in sewer sediment, providing a new perspective for studying sewer sediment prokaryotic community structure

    A comprehensive AI model development framework for consistent Gleason grading

    Get PDF
    Background: Artificial Intelligence(AI)-based solutions for Gleason grading hold promise for pathologists, while image quality inconsistency, continuous data integration needs, and limited generalizability hinder their adoption and scalability. Methods: We present a comprehensive digital pathology workflow for AI-assisted Gleason grading. It incorporates A!MagQC (image quality control), A!HistoClouds (cloud-based annotation), Pathologist-AI Interaction (PAI) for continuous model improvement, Trained on Akoya-scanned images only, the model utilizes color augmentation and image appearance migration to address scanner variations. We evaluate it on Whole Slide Images (WSI) from another five scanners and conduct validations with pathologists to assess AI efficacy and PAI. Results: Our model achieves an average F1 score of 0.80 on annotations and 0.71 Quadratic Weighted Kappa on WSIs for Akoya-scanned images. Applying our generalization solution increases the average F1 score for Gleason pattern detection from 0.73 to 0.88 on images from other scanners. The model accelerates Gleason scoring time by 43% while maintaining accuracy. Additionally, PAI improve annotation efficiency by 2.5 times and led to further improvements in model performance. Conclusions: This pipeline represents a notable advancement in AI-assisted Gleason grading for improved consistency, accuracy, and efficiency. Unlike previous methods limited by scanner specificity, our model achieves outstanding performance across diverse scanners. This improvement paves the way for its seamless integration into clinical workflows
    • …
    corecore