34 research outputs found
Multiple Instance Learning: A Survey of Problem Characteristics and Applications
Multiple instance learning (MIL) is a form of weakly supervised learning
where training instances are arranged in sets, called bags, and a label is
provided for the entire bag. This formulation is gaining interest because it
naturally fits various problems and allows to leverage weakly labeled data.
Consequently, it has been used in diverse application fields such as computer
vision and document classification. However, learning from bags raises
important challenges that are unique to MIL. This paper provides a
comprehensive survey of the characteristics which define and differentiate the
types of MIL problems. Until now, these problem characteristics have not been
formally identified and described. As a result, the variations in performance
of MIL algorithms from one data set to another are difficult to explain. In
this paper, MIL problem characteristics are grouped into four broad categories:
the composition of the bags, the types of data distribution, the ambiguity of
instance labels, and the task to be performed. Methods specialized to address
each category are reviewed. Then, the extent to which these characteristics
manifest themselves in key MIL application areas are described. Finally,
experiments are conducted to compare the performance of 16 state-of-the-art MIL
methods on selected problem characteristics. This paper provides insight on how
the problem characteristics affect MIL algorithms, recommendations for future
benchmarking and promising avenues for research
Learning to Summarize Videos by Contrasting Clips
Video summarization aims at choosing parts of a video that narrate a story as
close as possible to the original one. Most of the existing video summarization
approaches focus on hand-crafted labels. As the number of videos grows
exponentially, there emerges an increasing need for methods that can learn
meaningful summarizations without labeled annotations. In this paper, we aim to
maximally exploit unsupervised video summarization while concentrating the
supervision to a few, personalized labels as an add-on. To do so, we formulate
the key requirements for the informative video summarization. Then, we propose
contrastive learning as the answer to both questions. To further boost
Contrastive video Summarization (CSUM), we propose to contrast top-k features
instead of a mean video feature as employed by the existing method, which we
implement with a differentiable top-k feature selector. Our experiments on
several benchmarks demonstrate, that our approach allows for meaningful and
diverse summaries when no labeled data is provided
Automatic annotation for weakly supervised learning of detectors
PhDObject detection in images and action detection in videos are among the most widely studied
computer vision problems, with applications in consumer photography, surveillance, and automatic
media tagging. Typically, these standard detectors are fully supervised, that is they require
a large body of training data where the locations of the objects/actions in images/videos have
been manually annotated. With the emergence of digital media, and the rise of high-speed internet,
raw images and video are available for little to no cost. However, the manual annotation
of object and action locations remains tedious, slow, and expensive. As a result there has been
a great interest in training detectors with weak supervision where only the presence or absence
of object/action in image/video is needed, not the location. This thesis presents approaches for
weakly supervised learning of object/action detectors with a focus on automatically annotating
object and action locations in images/videos using only binary weak labels indicating the presence
or absence of object/action in images/videos.
First, a framework for weakly supervised learning of object detectors in images is presented.
In the proposed approach, a variation of multiple instance learning (MIL) technique for automatically
annotating object locations in weakly labelled data is presented which, unlike existing
approaches, uses inter-class and intra-class cue fusion to obtain the initial annotation. The initial
annotation is then used to start an iterative process in which standard object detectors are used to
refine the location annotation. Finally, to ensure that the iterative training of detectors do not drift
from the object of interest, a scheme for detecting model drift is also presented. Furthermore,
unlike most other methods, our weakly supervised approach is evaluated on data without manual
pose (object orientation) annotation.
Second, an analysis of the initial annotation of objects, using inter-class and intra-class cues,
is carried out. From the analysis, a new method based on negative mining (NegMine) is presented
for the initial annotation of both object and action data. The NegMine based approach is a
much simpler formulation using only inter-class measure and requires no complex combinatorial
optimisation but can still meet or outperform existing approaches including the previously pre3
sented inter-intra class cue fusion approach. Furthermore, NegMine can be fused with existing
approaches to boost their performance.
Finally, the thesis will take a step back and look at the use of generic object detectors as prior
knowledge in weakly supervised learning of object detectors. These generic object detectors are
typically based on sampling saliency maps that indicate if a pixel belongs to the background
or foreground. A new approach to generating saliency maps is presented that, unlike existing
approaches, looks beyond the current image of interest and into images similar to the current
image. We show that our generic object proposal method can be used by itself to annotate the
weakly labelled object data with surprisingly high accuracy
TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection
Video moment retrieval (MR) and highlight detection (HD) based on natural
language queries are two highly related tasks, which aim to obtain relevant
moments within videos and highlight scores of each video clip. Recently,
several methods have been devoted to building DETR-based networks to solve both
MR and HD jointly. These methods simply add two separate task heads after
multi-modal feature extraction and feature interaction, achieving good
performance. Nevertheless, these approaches underutilize the reciprocal
relationship between two tasks. In this paper, we propose a task-reciprocal
transformer based on DETR (TR-DETR) that focuses on exploring the inherent
reciprocity between MR and HD. Specifically, a local-global multi-modal
alignment module is first built to align features from diverse modalities into
a shared latent space. Subsequently, a visual feature refinement is designed to
eliminate query-irrelevant information from visual features for modal
interaction. Finally, a task cooperation module is constructed to refine the
retrieval pipeline and the highlight score prediction process by utilizing the
reciprocity between MR and HD. Comprehensive experiments on QVHighlights,
Charades-STA and TVSum datasets demonstrate that TR-DETR outperforms existing
state-of-the-art methods. Codes are available at
\url{https://github.com/mingyao1120/TR-DETR}.Comment: Accepted by AAAI-2
Collaborative Noisy Label Cleaner: Learning Scene-aware Trailers for Multi-modal Highlight Detection in Movies
Movie highlights stand out of the screenplay for efficient browsing and play
a crucial role on social media platforms. Based on existing efforts, this work
has two observations: (1) For different annotators, labeling highlight has
uncertainty, which leads to inaccurate and time-consuming annotations. (2)
Besides previous supervised or unsupervised settings, some existing video
corpora can be useful, e.g., trailers, but they are often noisy and incomplete
to cover the full highlights. In this work, we study a more practical and
promising setting, i.e., reformulating highlight detection as "learning with
noisy labels". This setting does not require time-consuming manual annotations
and can fully utilize existing abundant video corpora. First, based on movie
trailers, we leverage scene segmentation to obtain complete shots, which are
regarded as noisy labels. Then, we propose a Collaborative noisy Label Cleaner
(CLC) framework to learn from noisy highlight moments. CLC consists of two
modules: augmented cross-propagation (ACP) and multi-modality cleaning (MMC).
The former aims to exploit the closely related audio-visual signals and fuse
them to learn unified multi-modal representations. The latter aims to achieve
cleaner highlight labels by observing the changes in losses among different
modalities. To verify the effectiveness of CLC, we further collect a
large-scale highlight dataset named MovieLights. Comprehensive experiments on
MovieLights and YouTube Highlights datasets demonstrate the effectiveness of
our approach. Code has been made available at:
https://github.com/TencentYoutuResearch/HighlightDetection-CLCComment: Accepted to CVPR202
Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling
Weakly-supervised action localization aims to recognize and localize action
instancese in untrimmed videos with only video-level labels. Most existing
models rely on multiple instance learning(MIL), where the predictions of
unlabeled instances are supervised by classifying labeled bags. The MIL-based
methods are relatively well studied with cogent performance achieved on
classification but not on localization. Generally, they locate temporal regions
by the video-level classification but overlook the temporal variations of
feature semantics. To address this problem, we propose a novel attention-based
hierarchically-structured latent model to learn the temporal variations of
feature semantics. Specifically, our model entails two components, the first is
an unsupervised change-points detection module that detects change-points by
learning the latent representations of video features in a temporal hierarchy
based on their rates of change, and the second is an attention-based
classification model that selects the change-points of the foreground as the
boundaries. To evaluate the effectiveness of our model, we conduct extensive
experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. The
experiments show that our method outperforms current state-of-the-art methods,
and even achieves comparable performance with fully-supervised methods.Comment: Accepted to ICCV 2023. arXiv admin note: text overlap with
arXiv:2203.15187, arXiv:2003.12424, arXiv:2104.02967 by other author