75 research outputs found
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation
Spatial control is a core capability in controllable image generation.
Advancements in layout-guided image generation have shown promising results on
in-distribution (ID) datasets with similar spatial configurations. However, it
is unclear how these models perform when facing out-of-distribution (OOD)
samples with arbitrary, unseen layouts. In this paper, we propose LayoutBench,
a diagnostic benchmark for layout-guided image generation that examines four
categories of spatial control skills: number, position, size, and shape. We
benchmark two recent representative layout-guided image generation methods and
observe that the good ID layout control may not generalize well to arbitrary
layouts in the wild (e.g., objects at the boundary). Next, we propose
IterInpaint, a new baseline that generates foreground and background regions in
a step-by-step manner via inpainting, demonstrating stronger generalizability
than existing models on OOD layouts in LayoutBench. We perform quantitative and
qualitative evaluation and fine-grained analysis on the four LayoutBench skills
to pinpoint the weaknesses of existing models. Lastly, we show comprehensive
ablation studies on IterInpaint, including training task ratio, crop&paste vs.
repaint, and generation order. Project website: https://layoutbench.github.ioComment: 22 pages; Project website: https://layoutbench.github.i
An Empirical Study of Multimodal Model Merging
Model merging (e.g., via interpolation or task arithmetic) fuses multiple
models trained on different tasks to generate a multi-task solution. The
technique has been proven successful in previous studies, where the models are
trained on similar tasks and with the same initialization. In this paper, we
expand on this concept to a multimodal setup by merging transformers trained on
different modalities. Furthermore, we conduct our study for a novel goal where
we can merge vision, language, and cross-modal transformers of a
modality-specific architecture to create a parameter-efficient
modality-agnostic architecture. Through comprehensive experiments, we
systematically investigate the key factors impacting model performance after
merging, including initialization, merging mechanisms, and model architectures.
Our analysis leads to an effective training recipe for matching the performance
of the modality-agnostic baseline (i.e. pre-trained from scratch) via model
merging. Our code is available at: https://github.com/ylsung/vl-mergin
Clinical characteristics of cystic encephalomalacia in children
PurposeTo investigate the primary causes and clinical characteristics of cystic encephalomalacia (CE) in children.MethodsThe clinical data of 50 children who were admitted to our hospital due to CE between January 2008 and December 2020 were retrospectively reviewed. Their primary causes, clinical manifestations and cranial magnetic resonance imaging features were analyzed.ResultsAmong all patients, 5 had prematurity, 19 had hypoxic-ischemic encephalopathy (HIE), 13 had intracranial infection, 14 had traumatic brain injury and hemorrhage, 4 had cerebral infarction, 2 had congenital genetic diseases, and 1 had hypoglycemia. The average time from primary disease onset to CE diagnosis was 70.1 ± 61.0 days. The clinical manifestations included speech or motor developmental delay (n = 33), epilepsy (n = 31), dystonia (n = 27), limb paralysis (n = 16), and visual or auditory impairment (n = 5). Patients with HIE as the primary cause of CE had a significantly higher occurrence of dystonia, while a significantly higher incidence of paralysis was observed in those with cerebral infarction as the primary cause.ConclusionCE in children is mainly caused by HIE, intracranial infection, and cerebral hemorrhage. The major clinical manifestations included speech or motor developmental delay, epilepsy, and dystonia. Magnetic resonance imaging is an important tool for the diagnosis of CE
Prompting GPT-3 To Be Reliable
Large language models (LLMs) show impressive abilities via few-shot
prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use
in real-world language applications. However, the crucial problem of how to
improve the reliability of GPT-3 is still under-explored. While reliability is
a broad and vaguely defined term, we decompose reliability into four main
facets that correspond to the existing framework of ML safety and are
well-recognized to be important: generalizability, social biases, calibration,
and factuality. Our core contribution is to establish simple and effective
prompts that improve GPT-3's reliability as it: 1) generalizes
out-of-distribution, 2) balances demographic distribution and uses natural
language instructions to reduce social biases, 3) calibrates output
probabilities, and 4) updates the LLM's factual knowledge and reasoning chains.
With appropriate prompts, GPT-3 is more reliable than smaller-scale supervised
models on all these facets. We release all processed datasets, evaluation
scripts, and model predictions. Our systematic empirical study not only sheds
new insights on the reliability of prompting LLMs, but more importantly, our
prompting strategies can help practitioners more reliably use LLMs like GPT-3.Comment: ICLR 202
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Masked visual modeling (MVM) has been recently proven effective for visual
pre-training. While similar reconstructive objectives on video inputs (e.g.,
masked frame modeling) have been explored in video-language (VidL)
pre-training, previous studies fail to find a truly effective MVM strategy that
can largely benefit the downstream performance. In this work, we systematically
examine the potential of MVM in the context of VidL learning. Specifically, we
base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where
the supervision from MVM training can be backpropagated to the video pixel
space. In total, eight different reconstructive targets of MVM are explored,
from low-level pixel values and oriented gradients to high-level depth maps,
optical flow, discrete visual tokens, and latent visual features. We conduct
comprehensive experiments and provide insights into the factors leading to
effective MVM training, resulting in an enhanced model VIOLETv2. Empirically,
we show VIOLETv2 pre-trained with MVM objective achieves notable improvements
on 13 VidL benchmarks, ranging from video question answering, video captioning,
to text-to-video retrieval.Comment: CVPR'23; the first two authors contributed equally; code is available
at https://github.com/tsujuifu/pytorch_empirical-mv
ReCo: Region-Controlled Text-to-Image Generation
Recently, large-scale text-to-image (T2I) models have shown impressive
performance in generating high-fidelity images, but with limited
controllability, e.g., precisely specifying the content in a specific region
with a free-form text description. In this paper, we propose an effective
technique for such regional control in T2I generation. We augment T2I models'
inputs with an extra set of position tokens, which represent the quantized
spatial coordinates. Each region is specified by four position tokens to
represent the top-left and bottom-right corners, followed by an open-ended
natural language regional description. Then, we fine-tune a pre-trained T2I
model with such new input interface. Our model, dubbed as ReCo
(Region-Controlled T2I), enables the region control for arbitrary objects
described by open-ended regional texts rather than by object labels from a
constrained category set. Empirically, ReCo achieves better image quality than
the T2I model strengthened by positional words (FID: 8.82->7.36, SceneFID:
15.54->6.51 on COCO), together with objects being more accurately placed,
amounting to a 20.40% region classification accuracy improvement on COCO.
Furthermore, we demonstrate that ReCo can better control the object count,
spatial relationship, and region attributes such as color/size, with the
free-form regional description. Human evaluation on PaintSkill shows that ReCo
is +19.28% and +17.21% more accurate in generating images with correct object
count and spatial relationship than the T2I model
K-LITE: Learning Transferable Visual Models with External Knowledge
Recent state-of-the-art computer vision systems are trained from natural
language supervision, ranging from simple object category names to descriptive
captions. This free form of supervision ensures high generality and usability
of the learned visual models, based on extensive heuristics on data collection
to cover as many visual concepts as possible. Alternatively, learning with
external knowledge about images is a promising way which leverages a much more
structured source of supervision. In this paper, we propose K-LITE
(Knowledge-augmented Language-Image Training and Evaluation), a simple strategy
to leverage external knowledge to build transferable visual systems: In
training, it enriches entities in natural language with WordNet and Wiktionary
knowledge, leading to an efficient and scalable approach to learning image
representations that can understand both visual concepts and their knowledge;
In evaluation, the natural language is also augmented with external knowledge
and then used to reference learned visual concepts (or describe new ones) to
enable zero-shot and few-shot transfer of the pre-trained models. We study the
performance of K-LITE on two important computer vision problems, image
classification and object detection, benchmarking on 20 and 13 different
existing datasets, respectively. The proposed knowledge-augmented models show
significant improvement in transfer learning performance over existing methods.Comment: Preprint. The first three authors contribute equall
Toward an intensive understanding of sewer sediment prokaryotic community assembly and function
Prokaryotic communities play important roles in sewer sediment ecosystems, but the community composition, functional potential, and assembly mechanisms of sewer sediment prokaryotic communities are still poorly understood. Here, we studied the sediment prokaryotic communities in different urban functional areas (multifunctional, commercial, and residential areas) through 16S rRNA gene amplicon sequencing. Our results suggested that the compositions of prokaryotic communities varied significantly among functional areas. Desulfomicrobium, Desulfovibrio, and Desulfobacter involved in the sulfur cycle and some hydrolytic fermentation bacteria were enriched in multifunctional area, while Methanospirillum and Methanoregulaceae, which were related to methane metabolism were significantly discriminant taxa in the commercial area. Physicochemical properties were closely related to overall community changes (p < 0.001), especially the nutrient levels of sediments (i.e., total nitrogen and total phosphorus) and sediment pH. Network analysis revealed that the prokaryotic community network of the residential area sediment was more complex than the other functional areas, suggesting higher stability of the prokaryotic community in the residential area. Stochastic processes dominated the construction of the prokaryotic community. These results expand our understanding of the characteristics of prokaryotic communities in sewer sediment, providing a new perspective for studying sewer sediment prokaryotic community structure
A comprehensive AI model development framework for consistent Gleason grading
Background: Artificial Intelligence(AI)-based solutions for Gleason grading hold promise for pathologists, while image quality inconsistency, continuous data integration needs, and limited generalizability hinder their adoption and scalability. Methods: We present a comprehensive digital pathology workflow for AI-assisted Gleason grading. It incorporates A!MagQC (image quality control), A!HistoClouds (cloud-based annotation), Pathologist-AI Interaction (PAI) for continuous model improvement, Trained on Akoya-scanned images only, the model utilizes color augmentation and image appearance migration to address scanner variations. We evaluate it on Whole Slide Images (WSI) from another five scanners and conduct validations with pathologists to assess AI efficacy and PAI. Results: Our model achieves an average F1 score of 0.80 on annotations and 0.71 Quadratic Weighted Kappa on WSIs for Akoya-scanned images. Applying our generalization solution increases the average F1 score for Gleason pattern detection from 0.73 to 0.88 on images from other scanners. The model accelerates Gleason scoring time by 43% while maintaining accuracy. Additionally, PAI improve annotation efficiency by 2.5 times and led to further improvements in model performance. Conclusions: This pipeline represents a notable advancement in AI-assisted Gleason grading for improved consistency, accuracy, and efficiency. Unlike previous methods limited by scanner specificity, our model achieves outstanding performance across diverse scanners. This improvement paves the way for its seamless integration into clinical workflows
- …