1,124 research outputs found
Semantic-enriched visual vocabulary construction in a weakly supervised context
© 2015 - IOS Press and the authors. All rights reserved. One of the prevalent learning tasks involving images is content-based image classification. This is a difficult task especially because the low-level features used to digitally describe images usually capture little information about the semantics of the images. In this paper, we tackle this difficulty by enriching the semantic content of the image representation by using external knowledge. The underlying hypothesis of our work is that creating a more semantically rich representation for images would yield higher machine learning performances, without the need to modify the learning algorithms themselves. The external semantic information is presented under the form of non-positional image labels, therefore positioning our work in a weakly supervised context. Two approaches are proposed: the first one leverages the labels into the visual vocabulary construction algorithm, the result being dedicated visual vocabularies. The second approach adds a filtering phase as a pre-processing of the vocabulary construction. Known positive and known negative sets are constructed and features that are unlikely to be associated with the objects denoted by the labels are filtered. We apply our proposition to the task of content-based image classification and we show that semantically enriching the image representation yields higher classification performances than the baseline representation
TagBook: A Semantic Video Representation without Supervision for Event Detection
We consider the problem of event detection in video for scenarios where only
few, or even zero examples are available for training. For this challenging
setting, the prevailing solutions in the literature rely on a semantic video
representation obtained from thousands of pre-trained concept detectors.
Different from existing work, we propose a new semantic video representation
that is based on freely available social tagged videos only, without the need
for training any intermediate concept detectors. We introduce a simple
algorithm that propagates tags from a video's nearest neighbors, similar in
spirit to the ones used for image retrieval, but redesign it for video event
detection by including video source set refinement and varying the video tag
assignment. We call our approach TagBook and study its construction,
descriptiveness and detection performance on the TRECVID 2013 and 2014
multimedia event detection datasets and the Columbia Consumer Video dataset.
Despite its simple nature, the proposed TagBook video representation is
remarkably effective for few-example and zero-example event detection, even
outperforming very recent state-of-the-art alternatives building on supervised
representations.Comment: accepted for publication as a regular paper in the IEEE Transactions
on Multimedi
Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation
Open-vocabulary semantic segmentation is a challenging task that requires
segmenting novel object categories at inference time. Recent works explore
vision-language pre-training to handle this task, but suffer from unrealistic
assumptions in practical scenarios, i.e., low-quality textual category names.
For example, this paradigm assumes that new textual categories will be
accurately and completely provided, and exist in lexicons during pre-training.
However, exceptions often happen when meet with ambiguity for brief or
incomplete names, new words that are not present in the pre-trained lexicons,
and difficult-to-describe categories for users. To address these issues, this
work proposes a novel decomposition-aggregation framework, inspired by human
cognition in understanding new concepts. Specifically, in the decomposition
stage, we decouple class names into diverse attribute descriptions to enrich
semantic contexts. Two attribute construction strategies are designed: using
large language models for common categories, and involving manually labelling
for human-invented categories. In the aggregation stage, we group diverse
attributes into an integrated global description, to form a discriminative
classifier that distinguishes the target object from others. One hierarchical
aggregation is further designed to achieve multi-level alignment and deep
fusion between vision and text. The final result is obtained by computing the
embedding similarity between aggregated attributes and images. To evaluate the
effectiveness, we annotate three datasets with attribute descriptions, and
conduct extensive experiments and ablation studies. The results show the
superior performance of attribute decomposition-aggregation
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation
Recently, the contrastive language-image pre-training, e.g., CLIP, has
demonstrated promising results on various downstream tasks. The pre-trained
model can capture enriched visual concepts for images by learning from a large
scale of text-image data. However, transferring the learned visual knowledge to
open-vocabulary semantic segmentation is still under-explored. In this paper,
we propose a CLIP-based model named SegCLIP for the topic of open-vocabulary
segmentation in an annotation-free manner. The SegCLIP achieves segmentation
based on ViT and the main idea is to gather patches with learnable centers to
semantic regions through training on text-image pairs. The gathering operation
can dynamically capture the semantic groups, which can be used to generate the
final segmentation results. We further propose a reconstruction loss on masked
patches and a superpixel-based KL loss with pseudo-labels to enhance the visual
representation. Experimental results show that our model achieves comparable or
superior segmentation accuracy on the PASCAL VOC 2012 (+1.4% mIoU), PASCAL
Context (+2.4% mIoU), and COCO (+5.6% mIoU) compared with baselines. We release
the code at https://github.com/ArrowLuo/SegCLIP
情報検索における意味的ギャップの解消 : トピックモデルを用いた先進的画像探索
Tohoku University徳山豪課
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding
Open-vocabulary learning has emerged as a cutting-edge research area,
particularly in light of the widespread adoption of vision-based foundational
models. Its primary objective is to comprehend novel concepts that are not
encompassed within a predefined vocabulary. One key facet of this endeavor is
Visual Grounding, which entails locating a specific region within an image
based on a corresponding language description. While current foundational
models excel at various visual language tasks, there's a noticeable absence of
models specifically tailored for open-vocabulary visual grounding. This
research endeavor introduces novel and challenging OV tasks, namely
Open-Vocabulary Visual Grounding and Open-Vocabulary Phrase Localization. The
overarching aim is to establish connections between language descriptions and
the localization of novel objects. To facilitate this, we have curated a
comprehensive annotated benchmark, encompassing 7,272 OV-VG images and 1,000
OV-PL images. In our pursuit of addressing these challenges, we delved into
various baseline methodologies rooted in existing open-vocabulary object
detection, VG, and phrase localization frameworks. Surprisingly, we discovered
that state-of-the-art methods often falter in diverse scenarios. Consequently,
we developed a novel framework that integrates two critical components:
Text-Image Query Selection and Language-Guided Feature Attention. These modules
are designed to bolster the recognition of novel categories and enhance the
alignment between visual and linguistic information. Extensive experiments
demonstrate the efficacy of our proposed framework, which consistently attains
SOTA performance across the OV-VG task. Additionally, ablation studies provide
further evidence of the effectiveness of our innovative models. Codes and
datasets will be made publicly available at https://github.com/cv516Buaa/OV-VG
- …