5 research outputs found
Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs
We tackle open-world semantic segmentation, which aims at learning to segment
arbitrary visual concepts in images, by using only image-text pairs without
dense annotations. Existing open-world segmentation methods have shown
impressive advances by employing contrastive learning (CL) to learn diverse
visual concepts and transferring the learned image-level understanding to the
segmentation task. However, these CL-based methods suffer from a train-test
discrepancy, since it only considers image-text alignment during training,
whereas segmentation requires region-text alignment during testing. In this
paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework
that enables a model to directly learn region-text alignment. Our method
generates a segmentation mask for a given text, extracts text-grounded image
embedding from the masked region, and aligns it with text embedding via TCL. By
learning region-text alignment directly, our framework encourages a model to
directly improve the quality of generated segmentation masks. In addition, for
a rigorous and fair comparison, we present a unified evaluation protocol with
widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art
zero-shot segmentation performances with large margins in all datasets. Code is
available at https://github.com/kakaobrain/tcl.Comment: CVPR 2023 camera-read
Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning
Image captioning is one of the straightforward tasks that can take advantage
of large-scale web-crawled data which provides rich knowledge about the visual
world for a captioning model. However, since web-crawled data contains
image-text pairs that are aligned at different levels, the inherent noises
(e.g., misaligned pairs) make it difficult to learn a precise captioning model.
While the filtering strategy can effectively remove noisy data, however, it
leads to a decrease in learnable knowledge and sometimes brings about a new
problem of data deficiency. To take the best of both worlds, we propose a
noise-aware learning framework, which learns rich knowledge from the whole
web-crawled data while being less affected by the noises. This is achieved by
the proposed quality controllable model, which is learned using alignment
levels of the image-text pairs as an additional control signal during training.
The alignment-conditioned training allows the model to generate high-quality
captions of well-aligned by simply setting the control signal to desired
alignment level at inference time. Through in-depth analysis, we show that our
controllable captioning model is effective in handling noise. In addition, with
two tasks of zero-shot captioning and text-to-image retrieval using generated
captions (i.e., self-retrieval), we also demonstrate our model can produce
high-quality captions in terms of descriptiveness and distinctiveness. Code is
available at \url{https://github.com/kakaobrain/noc}
NICE 2023 Zero-shot Image Captioning Challenge
In this report, we introduce NICE
project\footnote{\url{https://nice.lgresearch.ai/}} and share the results and
outcomes of NICE challenge 2023. This project is designed to challenge the
computer vision community to develop robust image captioning models that
advance the state-of-the-art both in terms of accuracy and fairness. Through
the challenge, the image captioning models were tested using a new evaluation
dataset that includes a large variety of visual concepts from many domains.
There was no specific training data provided for the challenge, and therefore
the challenge entries were required to adapt to new types of image descriptions
that had not been seen during training. This report includes information on the
newly proposed NICE dataset, evaluation methods, challenge results, and
technical details of top-ranking entries. We expect that the outcomes of the
challenge will contribute to the improvement of AI models on various
vision-language tasks.Comment: Tech report, project page https://nice.lgresearch.ai