45 research outputs found
A Web Service for Video Summarization
This paper presents a Web service that supports the automatic generation of video summaries for user-submitted videos. The developed Web application decomposes the video into segments, evaluates the fitness of each segment to be included in the video summary and selects appropriate segments until a pre-defined time budget is filled. The integrated deep-learning-based video analysis and summarization technologies exhibit state-of-the-art performance and, by exploiting the processing capabilities of modern GPUs, offer faster than real-time processing. Configurations for generating video summaries that fulfill the specifications for posting on the most common video sharing platforms and social networks are available in the user interface of this application, enabling the one-click generation of distribution-channel-specific summaries
Deep Domain-Adversarial Image Generation for Domain Generalisation
Machine learning models typically suffer from the domain shift problem when
trained on a source dataset and evaluated on a target dataset of different
distribution. To overcome this problem, domain generalisation (DG) methods aim
to leverage data from multiple source domains so that a trained model can
generalise to unseen domains. In this paper, we propose a novel DG approach
based on \emph{Deep Domain-Adversarial Image Generation} (DDAIG). Specifically,
DDAIG consists of three components, namely a label classifier, a domain
classifier and a domain transformation network (DoTNet). The goal for DoTNet is
to map the source training data to unseen domains. This is achieved by having a
learning objective formulated to ensure that the generated data can be
correctly classified by the label classifier while fooling the domain
classifier. By augmenting the source training data with the generated unseen
domain data, we can make the label classifier more robust to unknown domain
changes. Extensive experiments on four DG datasets demonstrate the
effectiveness of our approach.Comment: 8 page
Semi-Supervised and Long-Tailed Object Detection with CascadeMatch
This paper focuses on long-tailed object detection in the semi-supervised
learning setting, which poses realistic challenges, but has rarely been studied
in the literature. We propose a novel pseudo-labeling-based detector called
CascadeMatch. Our detector features a cascade network architecture, which has
multi-stage detection heads with progressive confidence thresholds. To avoid
manually tuning the thresholds, we design a new adaptive pseudo-label mining
mechanism to automatically identify suitable values from data. To mitigate
confirmation bias, where a model is negatively reinforced by incorrect
pseudo-labels produced by itself, each detection head is trained by the
ensemble pseudo-labels of all detection heads. Experiments on two long-tailed
datasets, i.e., LVIS and COCO-LT, demonstrate that CascadeMatch surpasses
existing state-of-the-art semi-supervised approaches -- across a wide range of
detection architectures -- in handling long-tailed object detection. For
instance, CascadeMatch outperforms Unbiased Teacher by 1.9 AP Fix on LVIS when
using a ResNet50-based Cascade R-CNN structure, and by 1.7 AP Fix when using
Sparse R-CNN with a Transformer encoder. We also show that CascadeMatch can
even handle the challenging sparsely annotated object detection problem.Comment: International Journal of Computer Vision (IJCV), 202
Contextual Object Detection with Multimodal Large Language Models
Recent Multimodal Large Language Models (MLLMs) are remarkable in
vision-language tasks, such as image captioning and question answering, but
lack the essential perception ability, i.e., object detection. In this work, we
address this limitation by introducing a novel research problem of contextual
object detection -- understanding visible objects within different human-AI
interactive contexts. Three representative scenarios are investigated,
including the language cloze test, visual captioning, and question answering.
Moreover, we present ContextDET, a unified multimodal model that is capable of
end-to-end differentiable modeling of visual-language contexts, so as to
locate, identify, and associate visual objects with language inputs for
human-AI interaction. Our ContextDET involves three key submodels: (i) a visual
encoder for extracting visual representations, (ii) a pre-trained LLM for
multimodal context decoding, and (iii) a visual decoder for predicting bounding
boxes given contextual object words. The new generate-then-detect framework
enables us to detect object words within human vocabulary. Extensive
experiments show the advantages of ContextDET on our proposed CODE benchmark,
open-vocabulary detection, and referring image segmentation. Github:
https://github.com/yuhangzang/ContextDET.Comment: Github: https://github.com/yuhangzang/ContextDET, Project Page:
https://www.mmlab-ntu.com/project/contextdet/index.htm
Domain Generalization in Vision: A Survey
Generalization to out-of-distribution (OOD) data is a capability natural to
humans yet challenging for machines to reproduce. This is because most learning
algorithms strongly rely on the i.i.d.~assumption on source/target data, which
is often violated in practice due to domain shift. Domain generalization (DG)
aims to achieve OOD generalization by using only source data for model
learning. Since first introduced in 2011, research in DG has made great
progresses. In particular, intensive research in this topic has led to a broad
spectrum of methodologies, e.g., those based on domain alignment,
meta-learning, data augmentation, or ensemble learning, just to name a few; and
has covered various vision applications such as object recognition,
segmentation, action recognition, and person re-identification. In this paper,
for the first time a comprehensive literature review is provided to summarize
the developments in DG for computer vision over the past decade. Specifically,
we first cover the background by formally defining DG and relating it to other
research fields like domain adaptation and transfer learning. Second, we
conduct a thorough review into existing methods and present a categorization
based on their methodologies and motivations. Finally, we conclude this survey
with insights and discussions on future research directions.Comment: v4: includes the word "vision" in the title; improves the
organization and clarity in Section 2-3; adds future directions; and mor