21 research outputs found
Linguistic Structure Guided Context Modeling for Referring Image Segmentation
Referring image segmentation aims to predict the foreground mask of the
object referred by a natural language sentence. Multimodal context of the
sentence is crucial to distinguish the referent from the background. Existing
methods either insufficiently or redundantly model the multimodal context. To
tackle this problem, we propose a "gather-propagate-distribute" scheme to model
multimodal context by cross-modal interaction and implement this scheme as a
novel Linguistic Structure guided Context Modeling (LSCM) module. Our LSCM
module builds a Dependency Parsing Tree suppressed Word Graph (DPT-WG) which
guides all the words to include valid multimodal context of the sentence while
excluding disturbing ones through three steps over the multimodal feature,
i.e., gathering, constrained propagation and distributing. Extensive
experiments on four benchmarks demonstrate that our method outperforms all the
previous state-of-the-arts.Comment: Accepted by ECCV 2020. Code is available at
https://github.com/spyflying/LSCM-Refse
BiLingUNet: Image Segmentation by Modulating Top-Down and Bottom-Up Visual Processing with Referring Expressions
We present BiLingUNet, a state-of-the-art model for image segmentation using
referring expressions. BiLingUNet uses language to customize visual filters and
outperforms approaches that concatenate a linguistic representation to the
visual input. We find that using language to modulate both bottom-up and
top-down visual processing works better than just making the top-down
processing language-conditional. We argue that common 1x1 language-conditional
filters cannot represent relational concepts and experimentally demonstrate
that wider filters work better. Our model achieves state-of-the-art performance
on four referring expression datasets.Comment: 18 pages, 3 figures, submitted to ECCV 202