8,784 research outputs found
Learning Robust Visual-Semantic Embedding for Generalizable Person Re-identification
Generalizable person re-identification (Re-ID) is a very hot research topic
in machine learning and computer vision, which plays a significant role in
realistic scenarios due to its various applications in public security and
video surveillance. However, previous methods mainly focus on the visual
representation learning, while neglect to explore the potential of semantic
features during training, which easily leads to poor generalization capability
when adapted to the new domain. In this paper, we propose a Multi-Modal
Equivalent Transformer called MMET for more robust visual-semantic embedding
learning on visual, textual and visual-textual tasks respectively. To further
enhance the robust feature learning in the context of transformer, a dynamic
masking mechanism called Masked Multimodal Modeling strategy (MMM) is
introduced to mask both the image patches and the text tokens, which can
jointly works on multimodal or unimodal data and significantly boost the
performance of generalizable person Re-ID. Extensive experiments on benchmark
datasets demonstrate the competitive performance of our method over previous
approaches. We hope this method could advance the research towards
visual-semantic representation learning. Our source code is also publicly
available at https://github.com/JeremyXSC/MMET
ADS_UNet: A Nested UNet for Histopathology Image Segmentation
The UNet model consists of fully convolutional network (FCN) layers arranged
as contracting encoder and upsampling decoder maps. Nested arrangements of
these encoder and decoder maps give rise to extensions of the UNet model, such
as UNete and UNet++. Other refinements include constraining the outputs of the
convolutional layers to discriminate between segment labels when trained end to
end, a property called deep supervision. This reduces feature diversity in
these nested UNet models despite their large parameter space. Furthermore, for
texture segmentation, pixel correlations at multiple scales contribute to the
classification task; hence, explicit deep supervision of shallower layers is
likely to enhance performance. In this paper, we propose ADS UNet, a stage-wise
additive training algorithm that incorporates resource-efficient deep
supervision in shallower layers and takes performance-weighted combinations of
the sub-UNets to create the segmentation model. We provide empirical evidence
on three histopathology datasets to support the claim that the proposed ADS
UNet reduces correlations between constituent features and improves performance
while being more resource efficient. We demonstrate that ADS_UNet outperforms
state-of-the-art Transformer-based models by 1.08 and 0.6 points on CRAG and
BCSS datasets, and yet requires only 37% of GPU consumption and 34% of training
time as that required by Transformers.Comment: To be published in Expert Systems With Application
Semantic Segmentation Enhanced Transformer Model for Human Attention Prediction
Saliency Prediction aims to predict the attention distribution of human eyes
given an RGB image. Most of the recent state-of-the-art methods are based on
deep image feature representations from traditional CNNs. However, the
traditional convolution could not capture the global features of the image well
due to its small kernel size. Besides, the high-level factors which closely
correlate to human visual perception, e.g., objects, color, light, etc., are
not considered. Inspired by these, we propose a Transformer-based method with
semantic segmentation as another learning objective. More global cues of the
image could be captured by Transformer. In addition, simultaneously learning
the object segmentation simulates the human visual perception, which we would
verify in our investigation of human gaze control in cognitive science. We
build an extra decoder for the subtask and the multiple tasks share the same
Transformer encoder, forcing it to learn from multiple feature spaces. We find
in practice simply adding the subtask might confuse the main task learning,
hence Multi-task Attention Module is proposed to deal with the feature
interaction between the multiple learning targets. Our method achieves
competitive performance compared to other state-of-the-art methods
Identifizierung prädiktiver und prognostischer Biomarker in unterschiedlichen Tumorkompartimenten des ösophagealen Adenokarzinoms
Das ösophageale Adenokarzinom zeigt eine global steigende Inzidenz und hat mit einer 5-Jahres-Überlebensrate von weniger als 25% eine schlechte Prognose. Personalisierte Therapieansätze sind selten und prognostische/prädiktive Biomarker des Tumormikromilieus sind unzureichend charakterisiert. Die kumulative Promotion nähert sich dieser Problematik in drei unterschiedlichen Schwerpunkten. 1. Zur Identifizierung Kompartiment-spezifischer Biomarker wurde eine Methode entwickelt, welche als kostengünstige Alternative zum sc-Seq Expressionsprofile individueller Zelltypen generiert. Dabei erfolgt die Extraktion der RNA nicht aus Einzelzellen, sondern aus flowzytometrisch-getrennten Zellkompartimenten. Die Separation der Proben in Epithelzellen, Immunzellen und Fibroblasten wurde durch verschiedene Verfahren validiert und eine suffiziente Ausbeute an RNA auch für kleine Gewebemengen gezeigt. 2. Biomarker des Immunzellkompartiments als therapeutische Angriffspunkte wurden in einem Patientenkollektiv von bis zu 551 Patienten auf ihre Bedeutung beim EAC überprüft. Es zeigte sich eine Expression der Immuncheckpoints LAG3, VISTA und IDO auf TILs durch IHC und RNA-Sonden basierte Verfahren in einem relevanten Anteil (LAG3: 11,4%, VISTA: 29%, IDO: 52,6%). Es konnte eine prognostisch günstige Bedeutung der VISTA, LAG3 und IDO Expression gezeigt werden. Durch den Vergleich von Genexpressionsprofilen aus therapienaiven und vorbehandelten Tumoren konnte zudem ein immunsuppressiver Effekt von neoadjuvanten Therapiekonzepten auf das Tumormikromilieu des EACs gezeigt werden. Dabei kam es zur verminderten Expression von Checkpoints und Anzahl TILs nach (Radio-) Chemotherapie. 3. Im Tumorzellkompartiment wurde die Rolle von Amplifikationen in ErbB-Rezeptor abhängigen Signalwegen durch FISH-Technik und Immunhistochemie evaluiert. Es fanden sich KRAS Amplifikationen in 17,1%, PIK3CA Amplifikationen in 5% sowie eine HER2/neu-Überexpression in 14,9% der untersuchten Tumore
Leveraging Hidden Positives for Unsupervised Semantic Segmentation
Dramatic demand for manpower to label pixel-level annotations triggered the
advent of unsupervised semantic segmentation. Although the recent work
employing the vision transformer (ViT) backbone shows exceptional performance,
there is still a lack of consideration for task-specific training guidance and
local semantic consistency. To tackle these issues, we leverage contrastive
learning by excavating hidden positives to learn rich semantic relationships
and ensure semantic consistency in local regions. Specifically, we first
discover two types of global hidden positives, task-agnostic and task-specific
ones for each anchor based on the feature similarities defined by a fixed
pre-trained backbone and a segmentation head-in-training, respectively. A
gradual increase in the contribution of the latter induces the model to capture
task-specific semantic features. In addition, we introduce a gradient
propagation strategy to learn semantic consistency between adjacent patches,
under the inherent premise that nearby patches are highly likely to possess the
same semantics. Specifically, we add the loss propagating to local hidden
positives, semantically similar nearby patches, in proportion to the predefined
similarity scores. With these training schemes, our proposed method achieves
new state-of-the-art (SOTA) results in COCO-stuff, Cityscapes, and Potsdam-3
datasets. Our code is available at: https://github.com/hynnsk/HP.Comment: Accepted to CVPR 202
Joint Video Multi-Frame Interpolation and Deblurring under Unknown Exposure Time
Natural videos captured by consumer cameras often suffer from low framerate
and motion blur due to the combination of dynamic scene complexity, lens and
sensor imperfection, and less than ideal exposure setting. As a result,
computational methods that jointly perform video frame interpolation and
deblurring begin to emerge with the unrealistic assumption that the exposure
time is known and fixed. In this work, we aim ambitiously for a more realistic
and challenging task - joint video multi-frame interpolation and deblurring
under unknown exposure time. Toward this goal, we first adopt a variant of
supervised contrastive learning to construct an exposure-aware representation
from input blurred frames. We then train two U-Nets for intra-motion and
inter-motion analysis, respectively, adapting to the learned exposure
representation via gain tuning. We finally build our video reconstruction
network upon the exposure and motion representation by progressive
exposure-adaptive convolution and motion refinement. Extensive experiments on
both simulated and real-world datasets show that our optimized method achieves
notable performance gains over the state-of-the-art on the joint video x8
interpolation and deblurring task. Moreover, on the seemingly implausible x16
interpolation task, our method outperforms existing methods by more than 1.5 dB
in terms of PSNR.Comment: Accepted by CVPR 2023, available at
https://github.com/shangwei5/VIDU
MaPLe: Multi-modal Prompt Learning
Pre-trained vision-language (V-L) models such as CLIP have shown excellent
generalization ability to downstream tasks. However, they are sensitive to the
choice of input text prompts and require careful selection of prompt templates
to perform well. Inspired by the Natural Language Processing (NLP) literature,
recent CLIP adaptation approaches learn prompts as the textual inputs to
fine-tune CLIP for downstream tasks. We note that using prompting to adapt
representations in a single branch of CLIP (language or vision) is sub-optimal
since it does not allow the flexibility to dynamically adjust both
representation spaces on a downstream task. In this work, we propose
Multi-modal Prompt Learning (MaPLe) for both vision and language branches to
improve alignment between the vision and language representations. Our design
promotes strong coupling between the vision-language prompts to ensure mutual
synergy and discourages learning independent uni-modal solutions. Further, we
learn separate prompts across different early stages to progressively model the
stage-wise feature relationships to allow rich context learning. We evaluate
the effectiveness of our approach on three representative tasks of
generalization to novel classes, new target datasets and unseen domain shifts.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable
performance and achieves an absolute gain of 3.45% on novel classes and 2.72%
on overall harmonic-mean, averaged over 11 diverse image recognition datasets.
Our code and pre-trained models are available at
https://github.com/muzairkhattak/multimodal-prompt-learning.Comment: Accepted at CVPR202
CrossLoc3D: Aerial-Ground Cross-Source 3D Place Recognition
We present CrossLoc3D, a novel 3D place recognition method that solves a
large-scale point matching problem in a cross-source setting. Cross-source
point cloud data corresponds to point sets captured by depth sensors with
different accuracies or from different distances and perspectives. We address
the challenges in terms of developing 3D place recognition methods that account
for the representation gap between points captured by different sources. Our
method handles cross-source data by utilizing multi-grained features and
selecting convolution kernel sizes that correspond to most prominent features.
Inspired by the diffusion models, our method uses a novel iterative refinement
process that gradually shifts the embedding spaces from different sources to a
single canonical space for better metric learning. In addition, we present
CS-Campus3D, the first 3D aerial-ground cross-source dataset consisting of
point cloud data from both aerial and ground LiDAR scans. The point clouds in
CS-Campus3D have representation gaps and other features like different views,
point densities, and noise patterns. We show that our CrossLoc3D algorithm can
achieve an improvement of 4.74% - 15.37% in terms of the top 1 average recall
on our CS-Campus3D benchmark and achieves performance comparable to
state-of-the-art 3D place recognition method on the Oxford RobotCar. We will
release the code and CS-Campus3D benchmark
SC-VAE: Sparse Coding-based Variational Autoencoder
Learning rich data representations from unlabeled data is a key challenge
towards applying deep learning algorithms in downstream supervised tasks.
Several variants of variational autoencoders have been proposed to learn
compact data representaitons by encoding high-dimensional data in a lower
dimensional space. Two main classes of VAEs methods may be distinguished
depending on the characteristics of the meta-priors that are enforced in the
representation learning step. The first class of methods derives a continuous
encoding by assuming a static prior distribution in the latent space. The
second class of methods learns instead a discrete latent representation using
vector quantization (VQ) along with a codebook. However, both classes of
methods suffer from certain challenges, which may lead to suboptimal image
reconstruction results. The first class of methods suffers from posterior
collapse, whereas the second class of methods suffers from codebook collapse.
To address these challenges, we introduce a new VAE variant, termed SC-VAE
(sparse coding-based VAE), which integrates sparse coding within variational
autoencoder framework. Instead of learning a continuous or discrete latent
representation, the proposed method learns a sparse data representation that
consists of a linear combination of a small number of learned atoms. The sparse
coding problem is solved using a learnable version of the iterative shrinkage
thresholding algorithm (ISTA). Experiments on two image datasets demonstrate
that our model can achieve improved image reconstruction results compared to
state-of-the-art methods. Moreover, the use of learned sparse code vectors
allows us to perform downstream task like coarse image segmentation through
clustering image patches.Comment: 15 pages, 11 figures, and 3 table
Deep Learning for Scene Flow Estimation on Point Clouds: A Survey and Prospective Trends
Aiming at obtaining structural information and 3D motion of dynamic scenes, scene flow estimation has been an interest of research in computer vision and computer graphics for a long time. It is also a fundamental task for various applications such as autonomous driving. Compared to previous methods that utilize image representations, many recent researches build upon the power of deep analysis and focus on point clouds representation to conduct 3D flow estimation. This paper comprehensively reviews the pioneering literature in scene flow estimation based on point clouds. Meanwhile, it delves into detail in learning paradigms and presents insightful comparisons between the state-of-the-art methods using deep learning for scene flow estimation. Furthermore, this paper investigates various higher-level scene understanding tasks, including object tracking, motion segmentation, etc. and concludes with an overview of foreseeable research trends for scene flow estimation
- …