1,054 research outputs found
{DAFormer}: {I}mproving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation
As acquiring pixel-wise annotations of real-world images for semantic segmentation is a costly process, a model can instead be trained with more accessible synthetic data and adapted to real images without requiring their annotations. This process is studied in unsupervised domain adaptation (UDA). Even though a large number of methods propose new adaptation strategies, they are mostly based on outdated network architectures. As the influence of recent network architectures has not been systematically studied, we first benchmark different network architectures for UDA and then propose a novel UDA method, DAFormer, based on the benchmark results. The DAFormer network consists of a Transformer encoder and a multi-level context-aware feature fusion decoder. It is enabled by three simple but crucial training strategies to stabilize the training and to avoid overfitting DAFormer to the source domain: While the Rare Class Sampling on the source domain improves the quality of pseudo-labels by mitigating the confirmation bias of self-training towards common classes, the Thing-Class ImageNet Feature Distance and a learning rate warmup promote feature transfer from ImageNet pretraining. DAFormer significantly improves the state-of-the-art performance by 10.8 mIoU for GTA->Cityscapes and 5.4 mIoU for Synthia->Cityscapes and enables learning even difficult classes such as train, bus, and truck well. The implementation is available at https://github.com/lhoyer/DAFormer
Single-File Diffusion of Externally Driven Particles
We study 1-D diffusion of hard-core interacting Brownian particles driven
by the space- and time-dependent external force. We give the exact solution of
the -particle Smoluchowski diffusion equation. In particular, we investigate
the nonequilibrium energetics of two interacting particles under the
time-periodic driving. The hard-core interaction induces entropic repulsion
which differentiates the energetics of the two particles. We present exact
time-asymptotic results which describe the mean energy, the accepted work and
heat, and the entropy production of interacting particles and we contrast these
quantities against the corresponding ones for the non-interacting particles
Sound and Visual Representation Learning with Multiple Pretraining Tasks
Different self-supervised tasks (SSL) reveal different features from the data. The learned feature representations can exhibit different performance for each downstream task. In this light, this work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. Specifically, for this study, we investigate binaural sounds and image data in isolation. For binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal synchronization of foreground objects and binaural audio and temporal gap prediction. We investigate several approaches of Multi-SSL and give insights into the downstream task performance on video retrieval, spatial sound super resolution, and semantic prediction on the OmniAudio dataset. Our experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models and fully supervised models in the downstream task performance. As a check of applicability on other modality, we also formulate our Multi-SSL models for image representation learning and we use the recently proposed SSL tasks, MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2, DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83, +1.56 and +1.61 AP on COCO detection. Code will be made publicly available
MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation
In unsupervised domain adaptation (UDA), a model trained on source data (e.g.synthetic) is adapted to target data (e.g. real-world) without access to targetannotation. Most previous UDA methods struggle with classes that have a similarvisual appearance on the target domain as no ground truth is available to learnthe slight appearance differences. To address this problem, we propose a MaskedImage Consistency (MIC) module to enhance UDA by learning spatial contextrelations of the target domain as additional clues for robust visualrecognition. MIC enforces the consistency between predictions of masked targetimages, where random patches are withheld, and pseudo-labels that are generatedbased on the complete image by an exponential moving average teacher. Tominimize the consistency loss, the network has to learn to infer thepredictions of the masked regions from their context. Due to its simple anduniversal concept, MIC can be integrated into various UDA methods acrossdifferent visual recognition tasks such as image classification, semanticsegmentation, and object detection. MIC significantly improves thestate-of-the-art performance across the different recognition tasks forsynthetic-to-real, day-to-nighttime, and clear-to-adverse-weather UDA. Forinstance, MIC achieves an unprecedented UDA performance of 75.9 mIoU and 92.8%on GTA-to-Cityscapes and VisDA-2017, respectively, which corresponds to animprovement of +2.1 and +3.0 percent points over the previous state of the art.The implementation is available at https://github.com/lhoyer/MIC.<br
Composite Texture Synthesis
Many textures require complex models to describe their intricate structures. Their modeling can be simplified if they are considered composites of simpler subtextures. After an initial, unsupervised segmentation of the composite texture into the subtextures, it can be described at two levels. One is a label map texture, which captures the layout of the different subtextures. The other consists of the different subtextures. This scheme has to be refined to also include mutual influences between textures, mainly found near their boundaries. The proposed composite texture model also includes these. The paper describes an improved implementation of this idea. Whereas in a previous implementation subtextures and their interactions were synthesized sequentially, this paper proposes a parallel implementation, which yields results of higher qualit
HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection
Besides standard cameras, autonomous vehicles typically include multipleadditional sensors, such as lidars and radars, which help acquire richerinformation for perceiving the content of the driving scene. While severalrecent works focus on fusing certain pairs of sensors - such as camera andlidar or camera and radar - by using architectural components specific to theexamined setting, a generic and modular sensor fusion architecture is missingfrom the literature. In this work, we focus on 2D object detection, afundamental high-level task which is defined on the 2D image domain, andpropose HRFuser, a multi-resolution sensor fusion architecture that scalesstraightforwardly to an arbitrary number of input modalities. The design ofHRFuser is based on state-of-the-art high-resolution networks for image-onlydense prediction and incorporates a novel multi-window cross-attention block asthe means to perform fusion of multiple modalities at multiple resolutions.Even though cameras alone provide very informative features for 2D detection,we demonstrate via extensive experiments on the nuScenes and Seeing Through Fogdatasets that our model effectively leverages complementary features fromadditional modalities, substantially improving upon camera-only performance andconsistently outperforming state-of-the-art fusion methods for 2D detectionboth in normal and adverse conditions. The source code will be made publiclyavailable.<br
- …