11 research outputs found

    Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation

    Full text link
    Using multiple spatial modalities has been proven helpful in improving semantic segmentation performance. However, there are several real-world challenges that have yet to be addressed: (a) improving label efficiency and (b) enhancing robustness in realistic scenarios where modalities are missing at the test time. To address these challenges, we first propose a simple yet efficient multi-modal fusion mechanism Linear Fusion, that performs better than the state-of-the-art multi-modal models even with limited supervision. Second, we propose M3L: Multi-modal Teacher for Masked Modality Learning, a semi-supervised framework that not only improves the multi-modal performance but also makes the model robust to the realistic missing modality scenario using unlabeled data. We create the first benchmark for semi-supervised multi-modal semantic segmentation and also report the robustness to missing modalities. Our proposal shows an absolute improvement of up to 10% on robust mIoU above the most competitive baselines. Our code is available at https://github.com/harshm121/M3

    We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

    Full text link
    There has been abundant work in unsupervised domain adaptation for semantic segmentation (DAS) seeking to adapt a model trained on images from a labeled source domain to an unlabeled target domain. While the vast majority of prior work has studied this as a frame-level Image-DAS problem, a few Video-DAS works have sought to additionally leverage the temporal signal present in adjacent frames. However, Video-DAS works have historically studied a distinct set of benchmarks from Image-DAS, with minimal cross-benchmarking. In this work, we address this gap. Surprisingly, we find that (1) even after carefully controlling for data and model architecture, state-of-the-art Image-DAS methods (HRDA and HRDA+MIC) outperform Video-DAS methods on established Video-DAS benchmarks (+14.5 mIoU on Viper\rightarrowCityscapesSeq, +19.0 mIoU on Synthia\rightarrowCityscapesSeq), and (2) naive combinations of Image-DAS and Video-DAS techniques only lead to marginal improvements across datasets. To avoid siloed progress between Image-DAS and Video-DAS, we open-source our codebase with support for a comprehensive set of Video-DAS and Image-DAS methods on a common benchmark. Code available at https://github.com/SimarKareer/UnifiedVideoDAComment: TMLR 202

    Exploratory Navigation and Selective Reading

    No full text
    Navigating a collection of documents can be facilitated by obtaining a human-understandable concept hierarchy with links to the content. This is a non-trivial task for two reasons. First, defining concepts that are understandable by an average consumer and yet meaningful for a large variety of corpora is hard. Second, creating semantically meaningful yet intuitive hierarchical representation is hard, and can be task dependent. We present out system Navigation.ai which automatically processes a document collection, induces a concept hierarchy using Wikipedia and presents an interactive interface that helps user navigate to individual paragraphs using concepts
    corecore