28 research outputs found
Place Recognition under Occlusion and Changing Appearance via Disentangled Representations
Place recognition is a critical and challenging task for mobile robots,
aiming to retrieve an image captured at the same place as a query image from a
database. Existing methods tend to fail while robots move autonomously under
occlusion (e.g., car, bus, truck) and changing appearance (e.g., illumination
changes, seasonal variation). Because they encode the image into only one code,
entangling place features with appearance and occlusion features. To overcome
this limitation, we propose PROCA, an unsupervised approach to decompose the
image representation into three codes: a place code used as a descriptor to
retrieve images, an appearance code that captures appearance properties, and an
occlusion code that encodes occlusion content. Extensive experiments show that
our model outperforms the state-of-the-art methods. Our code and data are
available at https://github.com/rover-xingyu/PROCA
AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion
This paper proposes a simple and robust zero-shot voice conversion system
with a cycle structure and mel-spectrogram pre-processing. Previous works
suffer from information loss and poor synthesis quality due to their reliance
on a carefully designed bottleneck structure. Moreover, models relying solely
on self-reconstruction loss struggled with reproducing different speakers'
voices. To address these issues, we suggested a cycle-consistency loss that
considers conversion back and forth between target and source speakers.
Additionally, stacked random-shuffled mel-spectrograms and a label smoothing
method are utilized during speaker encoder training to extract a
time-independent global speaker representation from speech, which is the key to
a zero-shot conversion. Our model outperforms existing state-of-the-art results
in both subjective and objective evaluations. Furthermore, it facilitates
cross-lingual voice conversions and enhances the quality of synthesized speech
Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts in Underspecified Visual Tasks
Spurious correlations in the data, where multiple cues are predictive of the
target labels, often lead to shortcut learning phenomena, where a model may
rely on erroneous, easy-to-learn, cues while ignoring reliable ones. In this
work, we propose an ensemble diversification framework exploiting the
generation of synthetic counterfactuals using Diffusion Probabilistic Models
(DPMs). We discover that DPMs have the inherent capability to represent
multiple visual cues independently, even when they are largely correlated in
the training data. We leverage this characteristic to encourage model diversity
and empirically show the efficacy of the approach with respect to several
diversification objectives. We show that diffusion-guided diversification can
lead models to avert attention from shortcut cues, achieving ensemble diversity
performance comparable to previous methods requiring additional data
collection.Comment: Accepted at Neural Information Processing Systems(NeurIPS) 2023 -
Workshop on Diffusion Model
Learning Robust Representation for Joint Grading of Ophthalmic Diseases via Adaptive Curriculum and Feature Disentanglement
Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes
of permanent blindness worldwide. Designing an automatic grading system with
good generalization ability for DR and DME is vital in clinical practice.
However, prior works either grade DR or DME independently, without considering
internal correlations between them, or grade them jointly by shared feature
representation, yet ignoring potential generalization issues caused by
difficult samples and data bias. Aiming to address these problems, we propose a
framework for joint grading with the dynamic difficulty-aware weighted loss
(DAW) and the dual-stream disentangled learning architecture (DETACH). Inspired
by curriculum learning, DAW learns from simple samples to difficult samples
dynamically via measuring difficulty adaptively. DETACH separates features of
grading tasks to avoid potential emphasis on the bias. With the addition of DAW
and DETACH, the model learns robust disentangled feature representations to
explore internal correlations between DR and DME and achieve better grading
performance. Experiments on three benchmarks show the effectiveness and
robustness of our framework under both the intra-dataset and cross-dataset
tests.Comment: Accepted by MICCAI2
An improved StarGAN for emotional voice conversion: enhancing voice quality and data augmentation
Emotional Voice Conversion (EVC) aims to convert the emotional style of a
source speech signal to a target style while preserving its content and speaker
identity information. Previous emotional conversion studies do not disentangle
emotional information from emotion-independent information that should be
preserved, thus transforming it all in a monolithic manner and generating audio
of low quality, with linguistic distortions. To address this distortion
problem, we propose a novel StarGAN framework along with a two-stage training
process that separates emotional features from those independent of emotion by
using an autoencoder with two encoders as the generator of the Generative
Adversarial Network (GAN). The proposed model achieves favourable results in
both the objective evaluation and the subjective evaluation in terms of
distortion, which reveals that the proposed model can effectively reduce
distortion. Furthermore, in data augmentation experiments for end-to-end speech
emotion recognition, the proposed StarGAN model achieves an increase of 2% in
Micro-F1 and 5% in Macro-F1 compared to the baseline StarGAN model, which
indicates that the proposed model is more valuable for data augmentation.Comment: Accepted by Interspeech 202
BoIR: Box-Supervised Instance Representation for Multi-Person Pose Estimation
Single-stage multi-person human pose estimation (MPPE) methods have shown
great performance improvements, but existing methods fail to disentangle
features by individual instances under crowded scenes. In this paper, we
propose a bounding box-level instance representation learning called BoIR,
which simultaneously solves instance detection, instance disentanglement, and
instance-keypoint association problems. Our new instance embedding loss
provides a learning signal on the entire area of the image with bounding box
annotations, achieving globally consistent and disentangled instance
representation. Our method exploits multi-task learning of bottom-up keypoint
estimation, bounding box regression, and contrastive instance embedding
learning, without additional computational cost during inference. BoIR is
effective for crowded scenes, outperforming state-of-the-art on COCO val (0.8
AP), COCO test-dev (0.5 AP), CrowdPose (4.9 AP), and OCHuman (3.5 AP). Code
will be available at https://github.com/uyoung-jeong/BoIRComment: Accepted to BMVC 2023, 19 pages including the appendix, 6 figures, 7
table
Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection
Most of existing RGB-D salient object detection (SOD) methods follow the
CNN-based paradigm, which is unable to model long-range dependencies across
space and modalities due to the natural locality of CNNs. Here we propose the
Hierarchical Cross-modal Transformer (HCT), a new multi-modal transformer, to
tackle this problem. Unlike previous multi-modal transformers that directly
connecting all patches from two modalities, we explore the cross-modal
complementarity hierarchically to respect the modality gap and spatial
discrepancy in unaligned regions. Specifically, we propose to use intra-modal
self-attention to explore complementary global contexts, and measure
spatial-aligned inter-modal attention locally to capture cross-modal
correlations. In addition, we present a Feature Pyramid module for Transformer
(FPT) to boost informative cross-scale integration as well as a
consistency-complementarity module to disentangle the multi-modal integration
path and improve the fusion adaptivity. Comprehensive experiments on a large
variety of public datasets verify the efficacy of our designs and the
consistent improvement over state-of-the-art models.Comment: 10 pages, 10 figure