46 research outputs found
Adaptive Temporal Encoding Network for Video Instance-level Human Parsing
Beyond the existing single-person and multiple-person human parsing tasks in
static images, this paper makes the first attempt to investigate a more
realistic video instance-level human parsing that simultaneously segments out
each person instance and parses each instance into more fine-grained parts
(e.g., head, leg, dress). We introduce a novel Adaptive Temporal Encoding
Network (ATEN) that alternatively performs temporal encoding among key frames
and flow-guided feature propagation from other consecutive frames between two
key frames. Specifically, ATEN first incorporates a Parsing-RCNN to produce the
instance-level parsing result for each key frame, which integrates both the
global human parsing and instance-level human segmentation into a unified
model. To balance between accuracy and efficiency, the flow-guided feature
propagation is used to directly parse consecutive frames according to their
identified temporal consistency with key frames. On the other hand, ATEN
leverages the convolution gated recurrent units (convGRU) to exploit temporal
changes over a series of key frames, which are further used to facilitate the
frame-level instance-level parsing. By alternatively performing direct feature
propagation between consistent frames and temporal encoding network among key
frames, our ATEN achieves a good balance between frame-level accuracy and time
efficiency, which is a common crucial problem in video object segmentation
research. To demonstrate the superiority of our ATEN, extensive experiments are
conducted on the most popular video segmentation benchmark (DAVIS) and a newly
collected Video Instance-level Parsing (VIP) dataset, which is the first video
instance-level human parsing dataset comprised of 404 sequences and over 20k
frames with instance-level and pixel-wise annotations.Comment: To appear in ACM MM 2018. Code link:
https://github.com/HCPLab-SYSU/ATEN. Dataset link: http://sysu-hcp.net/li
Towards High Performance Video Object Detection
There has been significant progresses for image object detection in recent
years. Nevertheless, video object detection has received little attention,
although it is more challenging and more important in practical scenarios.
Built upon the recent works, this work proposes a unified approach based on
the principle of multi-frame end-to-end learning of features and cross-frame
motion. Our approach extends prior works with three new techniques and steadily
pushes forward the performance envelope (speed-accuracy tradeoff), towards high
performance video object detection
Flow-Guided Feature Aggregation for Video Object Detection
Extending state-of-the-art object detectors from image to video is
challenging. The accuracy of detection suffers from degenerated object
appearances in videos, e.g., motion blur, video defocus, rare poses, etc.
Existing work attempts to exploit temporal information on box level, but such
methods are not trained end-to-end. We present flow-guided feature aggregation,
an accurate and end-to-end learning framework for video object detection. It
leverages temporal coherence on feature level instead. It improves the
per-frame features by aggregation of nearby features along the motion paths,
and thus improves the video recognition accuracy. Our method significantly
improves upon strong single-frame baselines in ImageNet VID, especially for
more challenging fast moving objects. Our framework is principled, and on par
with the best engineered systems winning the ImageNet VID challenges 2016,
without additional bells-and-whistles. The proposed method, together with Deep
Feature Flow, powered the winning entry of ImageNet VID challenges 2017. The
code is available at
https://github.com/msracver/Flow-Guided-Feature-Aggregation
Topological Floquet edge states in periodically curved waveguides
We study the Floquet edge states in arrays of periodically curved optical
waveguides described by the modulated Su-Schrieffer-Heeger model. Beyond the
bulk-edge correspondence, our study explores the interplay between band
topology and periodic modulations. By analysing the quasi-energy spectra and
Zak phase, we reveal that, although topological and non-topological edge states
can exist for the same parameters, \emph{they can not appear in the same
spectral gap}. In the high-frequency limit, we find analytically all boundaries
between the different phases and study the coexistence of topological and
non-topological edge states. In contrast to unmodulated systems, the edge
states appear due to either band topology or modulation-induced defects. This
means that periodic modulations may not only tune the parametric regions with
nontrivial topology, but may also support novel edge states.Comment: 11 pages, 5 figure
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models
The revolution of artificial intelligence content generation has been rapidly
accelerated with the booming text-to-image (T2I) diffusion models. Within just
two years of development, it was unprecedentedly of high-quality, diversity,
and creativity that the state-of-the-art models could generate. However, a
prevalent limitation persists in the effective communication with these popular
T2I models, such as Stable Diffusion, using natural language descriptions. This
typically makes an engaging image hard to obtain without expertise in prompt
engineering with complex word compositions, magic tags, and annotations.
Inspired by the recently released DALLE3 - a T2I model directly built-in
ChatGPT that talks human language, we revisit the existing T2I systems
endeavoring to align human intent and introduce a new task - interactive text
to image (iT2I), where people can interact with LLM for interleaved
high-quality image generation/edit/refinement and question answering with
stronger images and text correspondences using natural language. In addressing
the iT2I problem, we present a simple approach that augments LLMs for iT2I with
prompting techniques and off-the-shelf T2I models. We evaluate our approach for
iT2I in a variety of common-used scenarios under different LLMs, e.g., ChatGPT,
LLAMA, Baichuan, and InternLM. We demonstrate that our approach could be a
convenient and low-cost way to introduce the iT2I ability for any existing LLMs
and any text-to-image models without any training while bringing little
degradation on LLMs' inherent capabilities in, e.g., question answering and
code generation. We hope this work could draw broader attention and provide
inspiration for boosting user experience in human-machine interactions
alongside the image quality of the next-generation T2I systems.Comment: Technical report. Project page at https://minidalle3.github.io