73 research outputs found
Knockoffs-SPR: Clean Sample Selection in Learning with Noisy Labels
A noisy training set usually leads to the degradation of the generalization
and robustness of neural networks. In this paper, we propose a novel
theoretically guaranteed clean sample selection framework for learning with
noisy labels. Specifically, we first present a Scalable Penalized Regression
(SPR) method, to model the linear relation between network features and one-hot
labels. In SPR, the clean data are identified by the zero mean-shift parameters
solved in the regression model. We theoretically show that SPR can recover
clean data under some conditions. Under general scenarios, the conditions may
be no longer satisfied; and some noisy data are falsely selected as clean data.
To solve this problem, we propose a data-adaptive method for Scalable Penalized
Regression with Knockoff filters (Knockoffs-SPR), which is provable to control
the False-Selection-Rate (FSR) in the selected clean data. To improve the
efficiency, we further present a split algorithm that divides the whole
training set into small pieces that can be solved in parallel to make the
framework scalable to large datasets. While Knockoffs-SPR can be regarded as a
sample selection module for a standard supervised training pipeline, we further
combine it with a semi-supervised algorithm to exploit the support of noisy
data as unlabeled data. Experimental results on several benchmark datasets and
real-world noisy datasets show the effectiveness of our framework and validate
the theoretical results of Knockoffs-SPR. Our code and pre-trained models are
available at https://github.com/Yikai-Wang/Knockoffs-SPR.Comment: update: refined theory and analysis, release cod
Root Pose Decomposition Towards Generic Non-rigid 3D Reconstruction with Monocular Videos
This work focuses on the 3D reconstruction of non-rigid objects based on
monocular RGB video sequences. Concretely, we aim at building high-fidelity
models for generic object categories and casually captured scenes. To this end,
we do not assume known root poses of objects, and do not utilize
category-specific templates or dense pose priors. The key idea of our method,
Root Pose Decomposition (RPD), is to maintain a per-frame root pose
transformation, meanwhile building a dense field with local transformations to
rectify the root pose. The optimization of local transformations is performed
by point registration to the canonical space. We also adapt RPD to multi-object
scenarios with object occlusions and individual differences. As a result, RPD
allows non-rigid 3D reconstruction for complicated scenarios containing objects
with large deformations, complex motion patterns, occlusions, and scale
diversities of different individuals. Such a pipeline potentially scales to
diverse sets of objects in the wild. We experimentally show that RPD surpasses
state-of-the-art methods on the challenging DAVIS, OVIS, and AMA datasets.Comment: ICCV 2023. Project Page: https://rpd-share.github.i
Channel Exchanging Networks for Multimodal and Multitask Dense Image Prediction
Multimodal fusion and multitask learning are two vital topics in machine
learning. Despite the fruitful progress, existing methods for both problems are
still brittle to the same challenge -- it remains dilemmatic to integrate the
common information across modalities (resp. tasks) meanwhile preserving the
specific patterns of each modality (resp. task). Besides, while they are
actually closely related to each other, multimodal fusion and multitask
learning are rarely explored within the same methodological framework before.
In this paper, we propose Channel-Exchanging-Network (CEN) which is
self-adaptive, parameter-free, and more importantly, applicable for both
multimodal fusion and multitask learning. At its core, CEN dynamically
exchanges channels between subnetworks of different modalities. Specifically,
the channel exchanging process is self-guided by individual channel importance
that is measured by the magnitude of Batch-Normalization (BN) scaling factor
during training. For the application of dense image prediction, the validity of
CEN is tested by four different scenarios: multimodal fusion, cycle multimodal
fusion, multitask learning, and multimodal multitask learning. Extensive
experiments on semantic segmentation via RGB-D data and image translation
through multi-domain input verify the effectiveness of our CEN compared to
current state-of-the-art methods. Detailed ablation studies have also been
carried out, which provably affirm the advantage of each component we propose.Comment: 18 pages. arXiv admin note: substantial text overlap with
arXiv:2011.0500
Joint fMRI Decoding and Encoding with Latent Embedding Alignment
The connection between brain activity and corresponding visual stimuli is
crucial in comprehending the human brain. While deep generative models have
exhibited advancement in recovering brain recordings by generating images
conditioned on fMRI signals, accomplishing high-quality generation with
consistent semantics continues to pose challenges. Moreover, the prediction of
brain activity from visual stimuli remains a formidable undertaking. In this
paper, we introduce a unified framework that addresses both fMRI decoding and
encoding. Commencing with the establishment of two latent spaces capable of
representing and reconstructing fMRI signals and visual images, respectively,
we proceed to align the fMRI signals and visual images within the latent space,
thereby enabling a bidirectional transformation between the two domains. Our
Latent Embedding Alignment (LEA) model concurrently recovers visual stimuli
from fMRI signals and predicts brain activity from images within a unified
framework. The performance of LEA surpasses that of existing methods on
multiple benchmark fMRI decoding and encoding datasets. By integrating fMRI
decoding and encoding, LEA offers a comprehensive solution for modeling the
intricate relationship between brain activity and visual stimuli.Comment: 12 pages, 5 figure
Multimodal Token Fusion for Vision Transformers
Many adaptations of transformers have emerged to address the single-modal
vision tasks, where self-attention modules are stacked to handle input sources
like images. Intuitively, feeding multiple modalities of data to vision
transformers could improve the performance, yet the inner-modal attentive
weights may also be diluted, which could thus undermine the final performance.
In this paper, we propose a multimodal token fusion method (TokenFusion),
tailored for transformer-based vision tasks. To effectively fuse multiple
modalities, TokenFusion dynamically detects uninformative tokens and
substitutes these tokens with projected and aggregated inter-modal features.
Residual positional alignment is also adopted to enable explicit utilization of
the inter-modal alignments after fusion. The design of TokenFusion allows the
transformer to learn correlations among multimodal features, while the
single-modal transformer architecture remains largely intact. Extensive
experiments are conducted on a variety of homogeneous and heterogeneous
modalities and demonstrate that TokenFusion surpasses state-of-the-art methods
in three typical vision tasks: multimodal image-to-image translation, RGB-depth
semantic segmentation, and 3D object detection with point cloud and images.Comment: CVPR 202
A Stabilized, Intrinsically Safe, 10% Efficient, Solar-Driven Water-Splitting Cell Incorporating Earth-Abundant Electrocatalysts with Steady-State pH Gradients and Product Separation Enabled by a Bipolar Membrane
An efficient, stable, and intrinsically safe solar water-splitting device is demonstrated using a III–V tandem junction photoanode, an acid-stable, earth-abundant hydrogen evolution catalyst, and a bipolar membrane. The integrated photoelectrochemical cell operates under a steady-state pH gradient and achieves ≈10% solar-to-hydrogen conversion efficiency, >100 h of stability in a large (>1 cm^2) photoactive area in relation to most previous reports
- …