691 research outputs found
Hierarchical Side-Tuning for Vision Transformers
Fine-tuning pre-trained Vision Transformers (ViT) has consistently
demonstrated promising performance in the realm of visual recognition. However,
adapting large pre-trained models to various tasks poses a significant
challenge. This challenge arises from the need for each model to undergo an
independent and comprehensive fine-tuning process, leading to substantial
computational and memory demands. While recent advancements in
Parameter-efficient Transfer Learning (PETL) have demonstrated their ability to
achieve superior performance compared to full fine-tuning with a smaller subset
of parameter updates, they tend to overlook dense prediction tasks such as
object detection and segmentation. In this paper, we introduce Hierarchical
Side-Tuning (HST), a novel PETL approach that enables ViT transfer to various
downstream tasks effectively. Diverging from existing methods that exclusively
fine-tune parameters within input spaces or certain modules connected to the
backbone, we tune a lightweight and hierarchical side network (HSN) that
leverages intermediate activations extracted from the backbone and generates
multi-scale features to make predictions. To validate HST, we conducted
extensive experiments encompassing diverse visual tasks, including
classification, object detection, instance segmentation, and semantic
segmentation. Notably, our method achieves state-of-the-art average Top-1
accuracy of 76.0% on VTAB-1k, all while fine-tuning a mere 0.78M parameters.
When applied to object detection tasks on COCO testdev benchmark, HST even
surpasses full fine-tuning and obtains better performance with 49.7 box AP and
43.2 mask AP using Cascade Mask R-CNN
Inferring the Direction of Introgression Using Genomic Sequence Data
Genomic data are informative about the history of species divergence and interspecific gene flow, including the direction, timing, and strength of gene flow. However, gene flow in opposite directions generates similar patterns in multilocus sequence data, such as reduced sequence divergence between the hybridizing species. As a result, inference of the direction of gene flow is challenging. Here we investigate the information about the direction of gene flow present in genomic sequence data using likelihood-based methods under the multispecies-coalescentwith-introgression model. We analyze the case of two species, and use simulation to examine cases with three or four species. We find that it is easier to infer gene flow from a small population to a large one than in the opposite direction, and easier to infer inflow (gene flow from outgroup species to an ingroup species) than outflow (gene flow from an ingroup species to an outgroup species). It is also easier to infer gene flow if there is a longer time of separate evolution between the initial divergence and subsequent introgression. When introgression is assumed to occur in the wrong direction, the time of introgression tends to be correctly estimated and the Bayesian test of gene flow is often significant, while estimates of introgression probability can be even greater than the true probability. We analyze genomic sequences from Heliconius butterflies to demonstrate that typical genomic datasets are informative about the direction of interspecific gene flow, as well as its timing and strength
BeautifulPrompt: Towards Automatic Prompt Engineering for Text-to-Image Synthesis
Recently, diffusion-based deep generative models (e.g., Stable Diffusion)
have shown impressive results in text-to-image synthesis. However, current
text-to-image models often require multiple passes of prompt engineering by
humans in order to produce satisfactory results for real-world applications. We
propose BeautifulPrompt, a deep generative model to produce high-quality
prompts from very simple raw descriptions, which enables diffusion-based models
to generate more beautiful images. In our work, we first fine-tuned the
BeautifulPrompt model over low-quality and high-quality collecting prompt
pairs. Then, to ensure that our generated prompts can generate more beautiful
images, we further propose a Reinforcement Learning with Visual AI Feedback
technique to fine-tune our model to maximize the reward values of the generated
prompts, where the reward values are calculated based on the PickScore and the
Aesthetic Scores. Our results demonstrate that learning from visual AI feedback
promises the potential to improve the quality of generated prompts and images
significantly. We further showcase the integration of BeautifulPrompt to a
cloud-native AI platform to provide better text-to-image generation service in
the cloud.Comment: emnlp 202
DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion
Self-attention-based vision transformers (ViTs) have emerged as a highly
competitive architecture in computer vision. Unlike convolutional neural
networks (CNNs), ViTs are capable of global information sharing. With the
development of various structures of ViTs, ViTs are increasingly advantageous
for many vision tasks. However, the quadratic complexity of self-attention
renders ViTs computationally intensive, and their lack of inductive biases of
locality and translation equivariance demands larger model sizes compared to
CNNs to effectively learn visual features. In this paper, we propose a
light-weight and efficient vision transformer model called DualToken-ViT that
leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the
token with local information obtained by convolution-based structure and the
token with global information obtained by self-attention-based structure to
achieve an efficient attention structure. In addition, we use position-aware
global tokens throughout all stages to enrich the global information, which
further strengthening the effect of DualToken-ViT. Position-aware global tokens
also contain the position information of the image, which makes our model
better for vision tasks. We conducted extensive experiments on image
classification, object detection and semantic segmentation tasks to demonstrate
the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of
different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G
FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using
global tokens by 0.7%
- …