76 research outputs found
Text Growing on Leaf
Irregular-shaped texts bring challenges to Scene Text Detection (STD).
Although existing contour point sequence-based approaches achieve comparable
performances, they fail to cover some highly curved ribbon-like text lines. It
leads to limited text fitting ability and STD technique application.
Considering the above problem, we combine text geometric characteristics and
bionics to design a natural leaf vein-based text representation method (LVT).
Concretely, it is found that leaf vein is a generally directed graph, which can
easily cover various geometries. Inspired by it, we treat text contour as leaf
margin and represent it through main, lateral, and thin veins. We further
construct a detection framework based on LVT, namely LeafText. In the text
reconstruction stage, LeafText simulates the leaf growth process to rebuild
text contour. It grows main vein in Cartesian coordinates to locate text
roughly at first. Then, lateral and thin veins are generated along the main
vein growth direction in polar coordinates. They are responsible for generating
coarse contour and refining it, respectively. Considering the deep dependency
of lateral and thin veins on main vein, the Multi-Oriented Smoother (MOS) is
proposed to enhance the robustness of main vein to ensure a reliable detection
result. Additionally, we propose a global incentive loss to accelerate the
predictions of lateral and thin veins. Ablation experiments demonstrate LVT is
able to depict arbitrary-shaped texts precisely and verify the effectiveness of
MOS and global incentive loss. Comparisons show that LeafText is superior to
existing state-of-the-art (SOTA) methods on MSRA-TD500, CTW1500, Total-Text,
and ICDAR2015 datasets
Zoom Text Detector
To pursue comprehensive performance, recent text detectors improve detection
speed at the expense of accuracy. They adopt shrink-mask based text
representation strategies, which leads to a high dependency of detection
accuracy on shrink-masks. Unfortunately, three disadvantages cause unreliable
shrink-masks. Specifically, these methods try to strengthen the discrimination
of shrink-masks from the background by semantic information. However, the
feature defocusing phenomenon that coarse layers are optimized by fine-grained
objectives limits the extraction of semantic features. Meanwhile, since both
shrink-masks and the margins belong to texts, the detail loss phenomenon that
the margins are ignored hinders the distinguishment of shrink-masks from the
margins, which causes ambiguous shrink-mask edges. Moreover, false-positive
samples enjoy similar visual features with shrink-masks. They aggravate the
decline of shrink-masks recognition. To avoid the above problems, we propose a
Zoom Text Detector (ZTD) inspired by the zoom process of the camera.
Specifically, Zoom Out Module (ZOM) is introduced to provide coarse-grained
optimization objectives for coarse layers to avoid feature defocusing.
Meanwhile, Zoom In Module (ZIM) is presented to enhance the margins recognition
to prevent detail loss. Furthermore, Sequential-Visual Discriminator (SVD) is
designed to suppress false-positive samples by sequential and visual features.
Experiments verify the superior comprehensive performance of ZTD
Propagate And Calibrate: Real-time Passive Non-line-of-sight Tracking
Non-line-of-sight (NLOS) tracking has drawn increasing attention in recent
years, due to its ability to detect object motion out of sight. Most previous
works on NLOS tracking rely on active illumination, e.g., laser, and suffer
from high cost and elaborate experimental conditions. Besides, these techniques
are still far from practical application due to oversimplified settings. In
contrast, we propose a purely passive method to track a person walking in an
invisible room by only observing a relay wall, which is more in line with real
application scenarios, e.g., security. To excavate imperceptible changes in
videos of the relay wall, we introduce difference frames as an essential
carrier of temporal-local motion messages. In addition, we propose PAC-Net,
which consists of alternating propagation and calibration, making it capable of
leveraging both dynamic and static messages on a frame-level granularity. To
evaluate the proposed method, we build and publish the first dynamic passive
NLOS tracking dataset, NLOS-Track, which fills the vacuum of realistic NLOS
datasets. NLOS-Track contains thousands of NLOS video clips and corresponding
trajectories. Both real-shot and synthetic data are included. Our codes and
dataset are available at https://againstentropy.github.io/NLOS-Track/.Comment: CVPR 2023 camera-ready version. Codes and dataset are available at
https://againstentropy.github.io/NLOS-Track
Traffic Sign Interpretation in Real Road Scene
Most existing traffic sign-related works are dedicated to detecting and
recognizing part of traffic signs individually, which fails to analyze the
global semantic logic among signs and may convey inaccurate traffic
instruction. Following the above issues, we propose a traffic sign
interpretation (TSI) task, which aims to interpret global semantic interrelated
traffic signs (e.g.,~driving instruction-related texts, symbols, and guide
panels) into a natural language for providing accurate instruction support to
autonomous or assistant driving. Meanwhile, we design a multi-task learning
architecture for TSI, which is responsible for detecting and recognizing
various traffic signs and interpreting them into a natural language like a
human. Furthermore, the absence of a public TSI available dataset prompts us to
build a traffic sign interpretation dataset, namely TSI-CN. The dataset
consists of real road scene images, which are captured from the highway and the
urban way in China from a driver's perspective. It contains rich location
labels of texts, symbols, and guide panels, and the corresponding natural
language description labels. Experiments on TSI-CN demonstrate that the TSI
task is achievable and the TSI architecture can interpret traffic signs from
scenes successfully even if there is a complex semantic logic among signs. The
TSI-CN dataset and the source code of the TSI architecture will be publicly
available after the revision process
One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field
Talking head generation aims to generate faces that maintain the identity
information of the source image and imitate the motion of the driving image.
Most pioneering methods rely primarily on 2D representations and thus will
inevitably suffer from face distortion when large head rotations are
encountered. Recent works instead employ explicit 3D structural representations
or implicit neural rendering to improve performance under large pose changes.
Nevertheless, the fidelity of identity and expression is not so desirable,
especially for novel-view synthesis. In this paper, we propose HiDe-NeRF, which
achieves high-fidelity and free-view talking-head synthesis. Drawing on the
recently proposed Deformable Neural Radiance Fields, HiDe-NeRF represents the
3D dynamic scene into a canonical appearance field and an implicit deformation
field, where the former comprises the canonical source face and the latter
models the driving pose and expression. In particular, we improve fidelity from
two aspects: (i) to enhance identity expressiveness, we design a generalized
appearance module that leverages multi-scale volume features to preserve face
shape and details; (ii) to improve expression preciseness, we propose a
lightweight deformation module that explicitly decouples the pose and
expression to enable precise expression modeling. Extensive experiments
demonstrate that our proposed approach can generate better results than
previous works. Project page: https://www.waytron.net/hidenerf/Comment: Accepted by CVPR 202
- …