497 research outputs found
GenText: Unsupervised Artistic Text Generation via Decoupled Font and Texture Manipulation
Automatic artistic text generation is an emerging topic which receives
increasing attention due to its wide applications. The artistic text can be
divided into three components, content, font, and texture, respectively.
Existing artistic text generation models usually focus on manipulating one
aspect of the above components, which is a sub-optimal solution for
controllable general artistic text generation. To remedy this issue, we propose
a novel approach, namely GenText, to achieve general artistic text style
transfer by separably migrating the font and texture styles from the different
source images to the target images in an unsupervised manner. Specifically, our
current work incorporates three different stages, stylization, destylization,
and font transfer, respectively, into a unified platform with a single powerful
encoder network and two separate style generator networks, one for font
transfer, the other for stylization and destylization. The destylization stage
first extracts the font style of the font reference image, then the font
transfer stage generates the target content with the desired font style.
Finally, the stylization stage renders the resulted font image with respect to
the texture style in the reference image. Moreover, considering the difficult
data acquisition of paired artistic text images, our model is designed under
the unsupervised setting, where all stages can be effectively optimized from
unpaired data. Qualitative and quantitative results are performed on artistic
text benchmarks, which demonstrate the superior performance of our proposed
model. The code with models will become publicly available in the future
MOSAIC: Multi-Object Segmented Arbitrary Stylization Using CLIP
Style transfer driven by text prompts paved a new path for creatively
stylizing the images without collecting an actual style image. Despite having
promising results, with text-driven stylization, the user has no control over
the stylization. If a user wants to create an artistic image, the user requires
fine control over the stylization of various entities individually in the
content image, which is not addressed by the current state-of-the-art
approaches. On the other hand, diffusion style transfer methods also suffer
from the same issue because the regional stylization control over the stylized
output is ineffective. To address this problem, We propose a new method
Multi-Object Segmented Arbitrary Stylization Using CLIP (MOSAIC), that can
apply styles to different objects in the image based on the context extracted
from the input prompt. Text-based segmentation and stylization modules which
are based on vision transformer architecture, were used to segment and stylize
the objects. Our method can extend to any arbitrary objects, styles and produce
high-quality images compared to the current state of art methods. To our
knowledge, this is the first attempt to perform text-guided arbitrary
object-wise stylization. We demonstrate the effectiveness of our approach
through qualitative and quantitative analysis, showing that it can generate
visually appealing stylized images with enhanced control over stylization and
the ability to generalize to unseen object classes.Comment: Camera ready, New Ideas in Vision Transformers workshop, ICCV 202
Text2Scene: Text-driven Indoor Scene Stylization with Part-aware Details
We propose Text2Scene, a method to automatically create realistic textures
for virtual scenes composed of multiple objects. Guided by a reference image
and text descriptions, our pipeline adds detailed texture on labeled 3D
geometries in the room such that the generated colors respect the hierarchical
structure or semantic parts that are often composed of similar materials.
Instead of applying flat stylization on the entire scene at a single step, we
obtain weak semantic cues from geometric segmentation, which are further
clarified by assigning initial colors to segmented parts. Then we add texture
details for individual objects such that their projections on image space
exhibit feature embedding aligned with the embedding of the input. The
decomposition makes the entire pipeline tractable to a moderate amount of
computation resources and memory. As our framework utilizes the existing
resources of image and text embedding, it does not require dedicated datasets
with high-quality textures designed by skillful artists. To the best of our
knowledge, it is the first practical and scalable approach that can create
detailed and realistic textures of the desired style that maintain structural
context for scenes with multiple objects.Comment: Accepted to CVPR 202
Visual Representation Learning with Limited Supervision
The quality of a Computer Vision system is proportional to the rigor of data representation it is built upon. Learning expressive representations of images is therefore the centerpiece to almost every computer vision application, including image search, object detection and classification, human re-identification, object tracking, pose understanding, image-to-image translation, and embodied agent navigation to name a few. Deep Neural Networks are most often seen among the modern methods of representation learning. The limitation is, however, that deep representation learning methods require extremely large amounts of manually labeled data for training. Clearly, annotating vast amounts of images for various environments is infeasible due to cost and time constraints. This requirement of obtaining labeled data is a prime restriction regarding pace of the development of visual recognition systems.
In order to cope with the exponentially growing amounts of visual data generated daily, machine learning algorithms have to at least strive to scale at a similar rate.
The second challenge consists in the learned representations having to generalize to novel objects, classes, environments and tasks in order to accommodate to the diversity of the visual world.
Despite the evergrowing number of recent publications tangentially addressing the topic of learning generalizable representations, efficient generalization is yet to be achieved. This dissertation attempts to tackle the problem of learning visual representations that can generalize to novel settings while requiring few labeled examples.
In this research, we study the limitations of the existing supervised representation learning approaches and propose a framework that improves the generalization of learned features by exploiting visual similarities between images which are not captured by provided manual annotations. Furthermore, to mitigate the common requirement of large scale manually annotated datasets, we propose several approaches that can learn expressive representations without human-attributed labels, in a self-supervised fashion, by grouping highly-similar samples into surrogate classes based on progressively learned representations.
The development of computer vision as science is preconditioned upon the seamless ability of a machine to record and disentangle pictures' attributes that were expected to only be conceived by humans. As such, particular interest was dedicated to the ability to analyze the means of artistic expression and style which depicts a more complex task than merely breaking an image down to colors and pixels. The ultimate test for this ability is the task of style transfer which involves altering the style of an image while keeping its content. An effective solution of style transfer requires learning such image representation which would allow disentangling image style and its content.
Moreover, particular artistic styles come with idiosyncrasies that affect which content details should be preserved and which discarded.
Another pitfall here is that it is impossible to get pixel-wise annotations of style and how the style should be altered.
We address this problem by proposing an unsupervised approach that enables encoding the image content in such a way that is required by a particular style.
The proposed approach exchanges the style of an input image by first extracting the content representation in a style-aware way and then rendering it in a new style using a style-specific decoder network, achieving compelling results in image and video stylization.
Finally, we combine supervised and self-supervised representation learning techniques for the task of human and animals pose understanding. The proposed method enables transfer of the representation learned for recognition of human poses to proximal mammal species without using labeled animal images. This approach is not limited to dense pose estimation and could potentially enable autonomous agents from robots to self-driving cars to retrain themselves and adapt to novel environments based on learning from previous experiences
Dual Stage Stylization Modulation for Domain Generalized Semantic Segmentation
Obtaining sufficient labeled data for training deep models is often
challenging in real-life applications. To address this issue, we propose a
novel solution for single-source domain generalized semantic segmentation.
Recent approaches have explored data diversity enhancement using hallucination
techniques. However, excessive hallucination can degrade performance,
particularly for imbalanced datasets. As shown in our experiments, minority
classes are more susceptible to performance reduction due to hallucination
compared to majority classes. To tackle this challenge, we introduce a
dual-stage Feature Transform (dFT) layer within the Adversarial Semantic
Hallucination+ (ASH+) framework. The ASH+ framework performs a dual-stage
manipulation of hallucination strength. By leveraging semantic information for
each pixel, our approach adaptively adjusts the pixel-wise hallucination
strength, thus providing fine-grained control over hallucination. We validate
the effectiveness of our proposed method through comprehensive experiments on
publicly available semantic segmentation benchmark datasets (Cityscapes and
SYNTHIA). Quantitative and qualitative comparisons demonstrate that our
approach is competitive with state-of-the-art methods for the Cityscapes
dataset and surpasses existing solutions for the SYNTHIA dataset. Code for our
framework will be made readily available to the research community
- …