13 research outputs found
Human 3D Avatar Modeling with Implicit Neural Representation: A Brief Survey
A human 3D avatar is one of the important elements in the metaverse, and the
modeling effect directly affects people's visual experience. However, the human
body has a complex topology and diverse details, so it is often expensive,
time-consuming, and laborious to build a satisfactory model. Recent studies
have proposed a novel method, implicit neural representation, which is a
continuous representation method and can describe objects with arbitrary
topology at arbitrary resolution. Researchers have applied implicit neural
representation to human 3D avatar modeling and obtained more excellent results
than traditional methods. This paper comprehensively reviews the application of
implicit neural representation in human body modeling. First, we introduce
three implicit representations of occupancy field, SDF, and NeRF, and make a
classification of the literature investigated in this paper. Then the
application of implicit modeling methods in the body, hand, and head are
compared and analyzed respectively. Finally, we point out the shortcomings of
current work and provide available suggestions for researchers.Comment: A Brief Surve
Text-oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences
Multimodal Sentiment Analysis (MSA) aims to mine sentiment information from
text, visual, and acoustic modalities. Previous works have focused on
representation learning and feature fusion strategies. However, most of these
efforts ignored the disparity in the semantic richness of different modalities
and treated each modality in the same manner. That may lead to strong
modalities being neglected and weak modalities being overvalued. Motivated by
these observations, we propose a Text-oriented Modality Reinforcement Network
(TMRN), which focuses on the dominance of the text modality in MSA. More
specifically, we design a Text-Centered Cross-modal Attention (TCCA) module to
make full interaction for text/acoustic and text/visual pairs, and a Text-Gated
Self-Attention (TGSA) module to guide the self-reinforcement of the other two
modalities. Furthermore, we present an adaptive fusion mechanism to decide the
proportion of different modalities involved in the fusion process. Finally, we
combine the feature matrices into vectors to get the final representation for
the downstream tasks. Experimental results show that our TMRN outperforms the
state-of-the-art methods on two MSA benchmarks.Comment: Accepted by CICAI 2023 (Finalist of Best Student Paper Award
SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model
With the development of large language models, many remarkable linguistic
systems like ChatGPT have thrived and achieved astonishing success on many
tasks, showing the incredible power of foundation models. In the spirit of
unleashing the capability of foundation models on vision tasks, the Segment
Anything Model (SAM), a vision foundation model for image segmentation, has
been proposed recently and presents strong zero-shot ability on many downstream
2D tasks. However, whether SAM can be adapted to 3D vision tasks has yet to be
explored, especially 3D object detection. With this inspiration, we explore
adapting the zero-shot ability of SAM to 3D object detection in this paper. We
propose a SAM-powered BEV processing pipeline to detect objects and get
promising results on the large-scale Waymo open dataset. As an early attempt,
our method takes a step toward 3D object detection with vision foundation
models and presents the opportunity to unleash their power on 3D vision tasks.
The code is released at https://github.com/DYZhang09/SAM3D.Comment: Technical Report. The code is released at
https://github.com/DYZhang09/SAM3
Boosting the Transferability of Adversarial Attacks with Global Momentum Initialization
Deep neural networks are vulnerable to adversarial examples, which attach
human invisible perturbations to benign inputs. Simultaneously, adversarial
examples exhibit transferability under different models, which makes practical
black-box attacks feasible. However, existing methods are still incapable of
achieving desired transfer attack performance. In this work, from the
perspective of gradient optimization and consistency, we analyze and discover
the gradient elimination phenomenon as well as the local momentum optimum
dilemma. To tackle these issues, we propose Global Momentum Initialization (GI)
to suppress gradient elimination and help search for the global optimum.
Specifically, we perform gradient pre-convergence before the attack and carry
out a global search during the pre-convergence stage. Our method can be easily
combined with almost all existing transfer methods, and we improve the success
rate of transfer attacks significantly by an average of 6.4% under various
advanced defense mechanisms compared to state-of-the-art methods. Eventually,
we achieve an attack success rate of 95.4%, fully illustrating the insecurity
of existing defense mechanisms
Improving Generalization in Visual Reinforcement Learning via Conflict-aware Gradient Agreement Augmentation
Learning a policy with great generalization to unseen environments remains
challenging but critical in visual reinforcement learning. Despite the success
of augmentation combination in the supervised learning generalization, naively
applying it to visual RL algorithms may damage the training efficiency,
suffering from serve performance degradation. In this paper, we first conduct
qualitative analysis and illuminate the main causes: (i) high-variance gradient
magnitudes and (ii) gradient conflicts existed in various augmentation methods.
To alleviate these issues, we propose a general policy gradient optimization
framework, named Conflict-aware Gradient Agreement Augmentation (CG2A), and
better integrate augmentation combination into visual RL algorithms to address
the generalization bias. In particular, CG2A develops a Gradient Agreement
Solver to adaptively balance the varying gradient magnitudes, and introduces a
Soft Gradient Surgery strategy to alleviate the gradient conflicts. Extensive
experiments demonstrate that CG2A significantly improves the generalization
performance and sample efficiency of visual RL algorithms.Comment: accepted by iccv202
Context De-confounded Emotion Recognition
Context-Aware Emotion Recognition (CAER) is a crucial and challenging task
that aims to perceive the emotional states of the target person with contextual
information. Recent approaches invariably focus on designing sophisticated
architectures or mechanisms to extract seemingly meaningful representations
from subjects and contexts. However, a long-overlooked issue is that a context
bias in existing datasets leads to a significantly unbalanced distribution of
emotional states among different context scenarios. Concretely, the harmful
bias is a confounder that misleads existing models to learn spurious
correlations based on conventional likelihood estimation, significantly
limiting the models' performance. To tackle the issue, this paper provides a
causality-based perspective to disentangle the models from the impact of such
bias, and formulate the causalities among variables in the CAER task via a
tailored causal graph. Then, we propose a Contextual Causal Intervention Module
(CCIM) based on the backdoor adjustment to de-confound the confounder and
exploit the true causal effect for model training. CCIM is plug-in and
model-agnostic, which improves diverse state-of-the-art approaches by
considerable margins. Extensive experiments on three benchmark datasets
demonstrate the effectiveness of our CCIM and the significance of causal
insight.Comment: Accepted by CVPR 2023. CCIM is available at
https://github.com/ydk122024/CCI
Direct field-to-pattern monolithic design of holographic metasurface via residual encoder-decoder convolutional neural network
Complex-amplitude holographic metasurfaces (CAHMs) with the flexibility in modulating phase and amplitude profiles have been used to manipulate the propagation of wavefront with an unprecedented level, leading to higher image-reconstruction quality compared with their natural counterparts. However, prevailing design methods of CAHMs are based on Huygens-Fresnel theory, meta-atom optimization, numerical simulation and experimental verification, which results in a consumption of computing resources. Here, we applied residual encoder-decoder convolutional neural network to directly map the electric field distributions and input images for monolithic metasurface design. A pretrained network is firstly trained by the electric field distributions calculated by diffraction theory, which is subsequently migrated as transfer learning framework to map the simulated electric field distributions and input images. The training results show that the normalized mean pixel error is about 3% on dataset. As verification, the metasurface prototypes are fabricated, simulated and measured. The reconstructed electric field of reverse-engineered metasurface exhibits high similarity to the target electric field, which demonstrates the effectiveness of our design. Encouragingly, this work provides a monolithic field-to-pattern design method for CAHMs, which paves a new route for the direct reconstruction of metasurfaces
Alighting Stop Determination of Unlinked Trips Based on a Two-Layer Stacking Framework
Smart card data of conventional bus passengers are important basic data for many studies such as bus network optimization. As only boarding information is recorded in most cities, alighting stops need to be identified. The classical trip chain method can only detect destinations of passengers who have trip cycles. However, the rest of unlinked trips without destinations are hard to analyze. To improve the accuracy of existing methods for determining alighting stops of unlinked trips, a two-layer stacking-framework-based method is proposed in this work. In the first layer, five methods are used, i.e., high-frequency stop method, stop attraction method, transfer convenience method, land-use type attraction method, and improved group historical set method (I-GHSM). Among them, the last one is presented here to cluster records with similar behavior patterns into a group more accurately. In the second layer, the logistic regression model is selected to get the appropriate weight of each method in the former layer for different datasets, which brings the generalization ability. Taking data from Xiamen BRT Line Kuai 1 as an example, I-GHSM given in the first layer has proved to be necessary and effective. Besides, the two-layer stacking-framework-based method can detect all destinations of unlinked trips with an accuracy of 51.88%, and this accuracy is higher than that of comparison methods, i.e., the two-step algorithms with KNN (k-nearest neighbor), Decision Tree or Random Forest, and a step-by-step method. Results indicate that the framework-based method presented has high accuracy in identifying all alighting stops of unlinked trips