13 research outputs found
LSTM Pose Machines
We observed that recent state-of-the-art results on single image human pose
estimation were achieved by multi-stage Convolution Neural Networks (CNN).
Notwithstanding the superior performance on static images, the application of
these models on videos is not only computationally intensive, it also suffers
from performance degeneration and flicking. Such suboptimal results are mainly
attributed to the inability of imposing sequential geometric consistency,
handling severe image quality degradation (e.g. motion blur and occlusion) as
well as the inability of capturing the temporal correlation among video frames.
In this paper, we proposed a novel recurrent network to tackle these problems.
We showed that if we were to impose the weight sharing scheme to the
multi-stage CNN, it could be re-written as a Recurrent Neural Network (RNN).
This property decouples the relationship among multiple network stages and
results in significantly faster speed in invoking the network for videos. It
also enables the adoption of Long Short-Term Memory (LSTM) units between video
frames. We found such memory augmented RNN is very effective in imposing
geometric consistency among frames. It also well handles input quality
degradation in videos while successfully stabilizes the sequential outputs. The
experiments showed that our approach significantly outperformed current
state-of-the-art methods on two large-scale video pose estimation benchmarks.
We also explored the memory cells inside the LSTM and provided insights on why
such mechanism would benefit the prediction for video-based pose estimations.Comment: Poster in IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 201
RestoreFormer++: Towards Real-World Blind Face Restoration from Undegraded Key-Value Pairs
Blind face restoration aims at recovering high-quality face images from those
with unknown degradations. Current algorithms mainly introduce priors to
complement high-quality details and achieve impressive progress. However, most
of these algorithms ignore abundant contextual information in the face and its
interplay with the priors, leading to sub-optimal performance. Moreover, they
pay less attention to the gap between the synthetic and real-world scenarios,
limiting the robustness and generalization to real-world applications. In this
work, we propose RestoreFormer++, which on the one hand introduces
fully-spatial attention mechanisms to model the contextual information and the
interplay with the priors, and on the other hand, explores an extending
degrading model to help generate more realistic degraded face images to
alleviate the synthetic-to-real-world gap. Compared with current algorithms,
RestoreFormer++ has several crucial benefits. First, instead of using a
multi-head self-attention mechanism like the traditional visual transformer, we
introduce multi-head cross-attention over multi-scale features to fully explore
spatial interactions between corrupted information and high-quality priors. In
this way, it can facilitate RestoreFormer++ to restore face images with higher
realness and fidelity. Second, in contrast to the recognition-oriented
dictionary, we learn a reconstruction-oriented dictionary as priors, which
contains more diverse high-quality facial details and better accords with the
restoration target. Third, we introduce an extending degrading model that
contains more realistic degraded scenarios for training data synthesizing, and
thus helps to enhance the robustness and generalization of our RestoreFormer++
model. Extensive experiments show that RestoreFormer++ outperforms
state-of-the-art algorithms on both synthetic and real-world datasets.Comment: Submitted to TPAMI. An extension of RestoreForme
StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation
This paper presents a LoRA-free method for stylized image generation that
takes a text prompt and style reference images as inputs and produces an output
image in a single pass. Unlike existing methods that rely on training a
separate LoRA for each style, our method can adapt to various styles with a
unified model. However, this poses two challenges: 1) the prompt loses
controllability over the generated content, and 2) the output image inherits
both the semantic and style features of the style reference image, compromising
its content fidelity. To address these challenges, we introduce StyleAdapter, a
model that comprises two components: a two-path cross-attention module (TPCA)
and three decoupling strategies. These components enable our model to process
the prompt and style reference features separately and reduce the strong
coupling between the semantic and style information in the style references.
StyleAdapter can generate high-quality images that match the content of the
prompts and adopt the style of the references (even for unseen styles) in a
single pass, which is more flexible and efficient than previous methods.
Experiments have been conducted to demonstrate the superiority of our method
over previous works.Comment: AIG
Diffusion-based Blind Text Image Super-Resolution
Recovering degraded low-resolution text images is challenging, especially for
Chinese text images with complex strokes and severe degradation in real-world
scenarios. Ensuring both text fidelity and style realness is crucial for
high-quality text image super-resolution. Recently, diffusion models have
achieved great success in natural image synthesis and restoration due to their
powerful data distribution modeling abilities and data generation capabilities.
In this work, we propose an Image Diffusion Model (IDM) to restore text images
with realistic styles. For diffusion models, they are not only suitable for
modeling realistic image distribution but also appropriate for learning text
distribution. Since text prior is important to guarantee the correctness of the
restored text structure according to existing arts, we also propose a Text
Diffusion Model (TDM) for text recognition which can guide IDM to generate text
images with correct structures. We further propose a Mixture of Multi-modality
module (MoM) to make these two diffusion models cooperate with each other in
all the diffusion steps. Extensive experiments on synthetic and real-world
datasets demonstrate that our Diffusion-based Blind Text Image Super-Resolution
(DiffTSR) can restore text images with more accurate text structures as well as
more realistic appearances simultaneously.Comment: Accepted by CVPR202
Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection
Multi-label image classification is a fundamental but challenging task
towards general visual understanding. Existing methods found the region-level
cues (e.g., features from RoIs) can facilitate multi-label classification.
Nevertheless, such methods usually require laborious object-level annotations
(i.e., object labels and bounding boxes) for effective learning of the
object-level visual features. In this paper, we propose a novel and efficient
deep framework to boost multi-label classification by distilling knowledge from
weakly-supervised detection task without bounding box annotations.
Specifically, given the image-level annotations, (1) we first develop a
weakly-supervised detection (WSD) model, and then (2) construct an end-to-end
multi-label image classification framework augmented by a knowledge
distillation module that guides the classification model by the WSD model
according to the class-level predictions for the whole image and the
object-level visual features for object RoIs. The WSD model is the teacher
model and the classification model is the student model. After this cross-task
knowledge distillation, the performance of the classification model is
significantly improved and the efficiency is maintained since the WSD model can
be safely discarded in the test phase. Extensive experiments on two large-scale
datasets (MS-COCO and NUS-WIDE) show that our framework achieves superior
performances over the state-of-the-art methods on both performance and
efficiency.Comment: accepted by ACM Multimedia 2018, 9 pages, 4 figures, 5 table
Image Deblurring Aided by Low-Resolution Events
Due to the limitation of event sensors, the spatial resolution of event data is relatively low compared to the spatial resolution of the conventional frame-based camera. However, low-spatial-resolution events recorded by event cameras are rich in temporal information which is helpful for image deblurring, while intensity images captured by frame cameras are in high resolution and have potential to promote the quality of events. Considering the complementarity between events and intensity images, an alternately performed model is proposed in this paper to deblur high-resolution images with the help of low-resolution events. This model is composed of two components: a DeblurNet and an EventSRNet. It first uses the DeblurNet to attain a preliminary sharp image aided by low-resolution events. Then, it enhances the quality of events with EventSRNet by extracting the structure information in the generated sharp image. Finally, the enhanced events are sent back into DeblurNet to attain a higher quality intensity image. Extensive evaluations on the synthetic GoPro dataset and real RGB-DAVIS dataset have shown the effectiveness of the proposed method
Image Deblurring Aided by Low-Resolution Events
Due to the limitation of event sensors, the spatial resolution of event data is relatively low compared to the spatial resolution of the conventional frame-based camera. However, low-spatial-resolution events recorded by event cameras are rich in temporal information which is helpful for image deblurring, while intensity images captured by frame cameras are in high resolution and have potential to promote the quality of events. Considering the complementarity between events and intensity images, an alternately performed model is proposed in this paper to deblur high-resolution images with the help of low-resolution events. This model is composed of two components: a DeblurNet and an EventSRNet. It first uses the DeblurNet to attain a preliminary sharp image aided by low-resolution events. Then, it enhances the quality of events with EventSRNet by extracting the structure information in the generated sharp image. Finally, the enhanced events are sent back into DeblurNet to attain a higher quality intensity image. Extensive evaluations on the synthetic GoPro dataset and real RGB-DAVIS dataset have shown the effectiveness of the proposed method