129 research outputs found
Bridging semantic gap: learning and integrating semantics for content-based retrieval
Digital cameras have entered ordinary homes and produced^incredibly large number
of photos. As a typical example of broad image domain, unconstrained consumer
photos vary significantly. Unlike professional or domain-specific images, the objects
in the photos are ill-posed, occluded, and cluttered with poor lighting, focus, and
exposure. Content-based image retrieval research has yet to bridge the semantic gap
between computable low-level information and high-level user interpretation.
In this thesis, we address the issue of semantic gap with a structured learning
framework to allow modular extraction of visual semantics. Semantic image regions
(e.g. face, building, sky etc) are learned statistically, detected directly from image
without segmentation, reconciled across multiple scales, and aggregated spatially to
form compact semantic index. To circumvent the ambiguity and subjectivity in a
query, a new query method that allows spatial arrangement of visual semantics is
proposed. A query is represented as a disjunctive normal form of visual query terms
and processed using fuzzy set operators.
A drawback of supervised learning is the manual labeling of regions as training
samples. In this thesis, a new learning framework to discover local semantic patterns
and to generate their samples for training with minimal human intervention has been
developed. The discovered patterns can be visualized and used in semantic indexing.
In addition, three new class-based indexing schemes are explored. The winnertake-
all scheme supports class-based image retrieval. The class relative scheme and
the local classification scheme compute inter-class memberships and local class patterns
as indexes for similarity matching respectively. A Bayesian formulation is
proposed to unify local and global indexes in image comparison and ranking that
resulted in superior image retrieval performance over those of single indexes.
Query-by-example experiments on 2400 consumer photos with 16 semantic queries
show that the proposed approaches have significantly better (18% to 55%) average
precisions than a high-dimension feature fusion approach. The thesis has paved
two promising research directions, namely the semantics design approach and the
semantics discovery approach. They form elegant dual frameworks that exploits
pattern classifiers in learning and integrating local and global image semantics
Active video summarization: Customized summaries via on-line interaction with the user
To facilitate the browsing of long videos, automatic video summarization provides an excerpt that represents its content. In the case of egocentric and consumer videos, due to their personal nature, adapting the summary to specific user's preferences is desirable. Current approaches to customizable video summarization obtain the user's preferences prior to the summarization process. As a result, the user needs to manually modify the summary to further meet the preferences. In this paper, we introduce Active Video Summarization (AVS), an interactive approach to gather the user's preferences while creating the summary. AVS asks questions about the summary to update it on-line until the user is satisfied. To minimize the interaction, the best segment to inquire next is inferred from the previous feedback. We evaluate AVS in the commonly used UTEgo dataset. We also introduce a new dataset for customized video summarization (CSumm) recorded with a Google Glass. The results show that AVS achieves an excellent compromise between usability and quality. In 41% of the videos, AVS is considered the best over all tested baselines, including summaries manually generated. Also, when looking for specific events in the video, AVS provides an average level of satisfaction higher than those of all other baselines after only six questions to the user
On the Robustness, Generalization, and Forgetting of Shape-Texture Debiased Continual Learning
Tremendous progress has been made in continual learning to maintain good
performance on old tasks when learning new tasks by tackling the catastrophic
forgetting problem of neural networks. This paper advances continual learning
by further considering its out-of-distribution robustness, in response to the
vulnerability of continually trained models to distribution shifts (e.g., due
to data corruptions and domain shifts) in inference. To this end, we propose
shape-texture debiased continual learning. The key idea is to learn
generalizable and robust representations for each task with shape-texture
debiased training. In order to transform standard continual learning to
shape-texture debiased continual learning, we propose shape-texture debiased
data generation and online shape-texture debiased self-distillation.
Experiments on six datasets demonstrate the benefits of our approach in
improving generalization and robustness, as well as reducing forgetting. Our
analysis on the flatness of the loss landscape explains the advantages.
Moreover, our approach can be easily combined with new advanced architectures
such as vision transformer, and applied to more challenging scenarios such as
exemplar-free continual learning
Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022
In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance
Retrieval Challenge 2022. We first parse sentences into semantic roles
corresponding to verbs and nouns; then utilize self-attentions to exploit
semantic role contextualized video features along with textual features via
triplet losses in multiple embedding spaces. Our method overpasses the strong
baseline in normalized Discounted Cumulative Gain (nDCG), which is more
valuable for semantic similarity. Our submission is ranked 3rd for nDCG and
ranked 4th for mAP.Comment: Ranked joint 3rd place in the Multi-Instance Retrieval Challenge at
EPIC@CVPR2022. (v2: ref error is corrected
Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention
Many studies focus on improving pretraining or developing new backbones in
text-video retrieval. However, existing methods may suffer from the learning
and inference bias issue, as recent research suggests in other
text-video-related tasks. For instance, spatial appearance features on action
recognition or temporal object co-occurrences on video scene graph generation
could induce spurious correlations. In this work, we present a unique and
systematic study of a temporal bias due to frame length discrepancy between
training and test sets of trimmed video clips, which is the first such attempt
for a text-video retrieval task, to the best of our knowledge. We first
hypothesise and verify the bias on how it would affect the model illustrated
with a baseline study. Then, we propose a causal debiasing approach and perform
extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2,
and MSR-VTT datasets. Our model overpasses the baseline and SOTA on nDCG, a
semantic-relevancy-focused evaluation metric which proves the bias is
mitigated, as well as on the other conventional metrics.Comment: Accepted by the British Machine Vision Conference (BMVC) 2023.
Project Page: https://buraksatar.github.io/FrameLengthBia
Finding any Waldo: zero-shot invariant and efficient visual search
Searching for a target object in a cluttered scene constitutes a fundamental
challenge in daily vision. Visual search must be selective enough to
discriminate the target from distractors, invariant to changes in the
appearance of the target, efficient to avoid exhaustive exploration of the
image, and must generalize to locate novel target objects with zero-shot
training. Previous work has focused on searching for perfect matches of a
target after extensive category-specific training. Here we show for the first
time that humans can efficiently and invariantly search for natural objects in
complex scenes. To gain insight into the mechanisms that guide visual search,
we propose a biologically inspired computational model that can locate targets
without exhaustive sampling and generalize to novel objects. The model provides
an approximation to the mechanisms integrating bottom-up and top-down signals
during search in natural scenes.Comment: Number of figures: 6 Number of supplementary figures: 1
Masked Diffusion with Task-awareness for Procedure Planning in Instructional Videos
A key challenge with procedure planning in instructional videos lies in how
to handle a large decision space consisting of a multitude of action types that
belong to various tasks. To understand real-world video content, an AI agent
must proficiently discern these action types (e.g., pour milk, pour water, open
lid, close lid, etc.) based on brief visual observation. Moreover, it must
adeptly capture the intricate semantic relation of the action types and task
goals, along with the variable action sequences. Recently, notable progress has
been made via the integration of diffusion models and visual representation
learning to address the challenge. However, existing models employ rudimentary
mechanisms to utilize task information to manage the decision space. To
overcome this limitation, we introduce a simple yet effective enhancement - a
masked diffusion model. The introduced mask acts akin to a task-oriented
attention filter, enabling the diffusion/denoising process to concentrate on a
subset of action types. Furthermore, to bolster the accuracy of task
classification, we harness more potent visual representation learning
techniques. In particular, we learn a joint visual-text embedding, where a text
embedding is generated by prompting a pre-trained vision-language model to
focus on human actions. We evaluate the method on three public datasets and
achieve state-of-the-art performance on multiple metrics. Code is available at
https://github.com/ffzzy840304/Masked-PDPP.Comment: 7 pages (main text excluding references), 3 figures, 7 table
Identifying Hard Noise in Long-Tailed Sample Distribution
Conventional de-noising methods rely on the assumption that all samples are
independent and identically distributed, so the resultant classifier, though
disturbed by noise, can still easily identify the noises as the outliers of
training distribution. However, the assumption is unrealistic in large-scale
data that is inevitably long-tailed. Such imbalanced training data makes a
classifier less discriminative for the tail classes, whose previously "easy"
noises are now turned into "hard" ones -- they are almost as outliers as the
clean tail samples. We introduce this new challenge as Noisy Long-Tailed
Classification (NLT). Not surprisingly, we find that most de-noising methods
fail to identify the hard noises, resulting in significant performance drop on
the three proposed NLT benchmarks: ImageNet-NLT, Animal10-NLT, and Food101-NLT.
To this end, we design an iterative noisy learning framework called
Hard-to-Easy (H2E). Our bootstrapping philosophy is to first learn a classifier
as noise identifier invariant to the class and context distributional changes,
reducing "hard" noises to "easy" ones, whose removal further improves the
invariance. Experimental results show that our H2E outperforms state-of-the-art
de-noising methods and their ablations on long-tailed settings while
maintaining a stable performance on the conventional balanced settings.
Datasets and codes are available at https://github.com/yxymessi/H2E-FrameworkComment: Accepted to ECCV2022(Oral) ; Datasets and codes are available at
https://github.com/yxymessi/H2E-Framewor
- …