19 research outputs found
Binary Classification with Positive Labeling Sources
To create a large amount of training labels for machine learning models
effectively and efficiently, researchers have turned to Weak Supervision (WS),
which uses programmatic labeling sources rather than manual annotation.
Existing works of WS for binary classification typically assume the presence of
labeling sources that are able to assign both positive and negative labels to
data in roughly balanced proportions. However, for many tasks of interest where
there is a minority positive class, negative examples could be too diverse for
developers to generate indicative labeling sources. Thus, in this work, we
study the application of WS on binary classification tasks with positive
labeling sources only. We propose WEAPO, a simple yet competitive WS method for
producing training labels without negative labeling sources. On 10 benchmark
datasets, we show WEAPO achieves the highest averaged performance in terms of
both the quality of synthesized labels and the performance of the final
classifier supervised with these labels. We incorporated the implementation of
\method into WRENCH, an existing benchmarking platform.Comment: CIKM 2022 (short
Search for Concepts: Discovering Visual Concepts Using Direct Optimization
Finding an unsupervised decomposition of an image into individual objects is a key step to leverage compositionality and to perform symbolic reasoning. Traditionally, the this problem is solved using amortized inference, which does not generalize beyond the scope of the training data, may sometimes miss correct decompositions, and requires large amounts of training data. We propose finding a decomposition using direct, un-amortized optimization, using a combination of a gradient-based optimization for differentiable object properties and global search for non-differentiable properties. We show that using direct optimization is more generalizable, misses fewer correct decompositions, and typically requires less data than methods based on amortized inference. This highlights a weakness of the current prevalent practice of using amortized inference that can potentially be improved by integrating more direct optimization elements
Tackling the Unannotated: Scene Graph Generation with Bias-Reduced Models
Predicting a scene graph that captures visual entities and their interactions
in an image has been considered a crucial step towards full scene
comprehension. Recent scene graph generation (SGG) models have shown their
capability of capturing the most frequent relations among visual entities.
However, the state-of-the-art results are still far from satisfactory, e.g.
models can obtain 31% in overall recall R@100, whereas the likewise important
mean class-wise recall mR@100 is only around 8% on Visual Genome (VG). The
discrepancy between R and mR results urges to shift the focus from pursuing a
high R to a high mR with a still competitive R. We suspect that the observed
discrepancy stems from both the annotation bias and sparse annotations in VG,
in which many visual entity pairs are either not annotated at all or only with
a single relation when multiple ones could be valid. To address this particular
issue, we propose a novel SGG training scheme that capitalizes on self-learned
knowledge. It involves two relation classifiers, one offering a less biased
setting for the other to base on. The proposed scheme can be applied to most of
the existing SGG models and is straightforward to implement. We observe
significant relative improvements in mR (between +6.6% and +20.4%) and
competitive or better R (between -2.4% and 0.3%) across all standard SGG tasks.Comment: accepted to BMVC202
Rethinking the Evaluation of Unbiased Scene Graph Generation
Since the severe imbalanced predicate distributions in common subject-object
relations, current Scene Graph Generation (SGG) methods tend to predict
frequent predicate categories and fail to recognize rare ones. To improve the
robustness of SGG models on different predicate categories, recent research has
focused on unbiased SGG and adopted mean Recall@K (mR@K) as the main evaluation
metric. However, we discovered two overlooked issues about this de facto
standard metric mR@K, which makes current unbiased SGG evaluation vulnerable
and unfair: 1) mR@K neglects the correlations among predicates and
unintentionally breaks category independence when ranking all the triplet
predictions together regardless of the predicate categories, leading to the
performance of some predicates being underestimated. 2) mR@K neglects the
compositional diversity of different predicates and assigns excessively high
weights to some oversimple category samples with limited composable relation
triplet types. It totally conflicts with the goal of SGG task which encourages
models to detect more types of visual relationship triplets. In addition, we
investigate the under-explored correlation between objects and predicates,
which can serve as a simple but strong baseline for unbiased SGG. In this
paper, we refine mR@K and propose two complementary evaluation metrics for
unbiased SGG: Independent Mean Recall (IMR) and weighted IMR (wIMR). These two
metrics are designed by considering the category independence and diversity of
composable relation triplets, respectively. We compare the proposed metrics
with the de facto standard metrics through extensive experiments and discuss
the solutions to evaluate unbiased SGG in a more trustworthy way
CREPE: Learnable Prompting With CLIP Improves Visual Relationship Prediction
In this paper, we explore the potential of Vision-Language Models (VLMs),
specifically CLIP, in predicting visual object relationships, which involves
interpreting visual features from images into language-based relations. Current
state-of-the-art methods use complex graphical models that utilize language
cues and visual features to address this challenge. We hypothesize that the
strong language priors in CLIP embeddings can simplify these graphical models
paving for a simpler approach. We adopt the UVTransE relation prediction
framework, which learns the relation as a translational embedding with subject,
object, and union box embeddings from a scene. We systematically explore the
design of CLIP-based subject, object, and union-box representations within the
UVTransE framework and propose CREPE (CLIP Representation Enhanced Predicate
Estimation). CREPE utilizes text-based representations for all three bounding
boxes and introduces a novel contrastive training strategy to automatically
infer the text prompt for union-box. Our approach achieves state-of-the-art
performance in predicate estimation, mR@5 27.79, and mR@20 31.95 on the Visual
Genome benchmark, achieving a 15.3\% gain in performance over recent
state-of-the-art at mR@20. This work demonstrates CLIP's effectiveness in
object relation prediction and encourages further research on VLMs in this
challenging domain