340 research outputs found
Bag-of-visual-words expansion using visual relatedness for video indexing, SIGIR ’08
Bag-of-visual-words (BoW) has been popular for visual classification in recent years. In this paper, we propose a novel BoW expansion method to alleviate the effect of visual word correlation problem. We achieve this by diffusing the weights of visual words in BoW based on visual word relatedness, which is rigorously defined within a visual ontology. The proposed method is tested in video indexing experiment on TRECVID-2006 video retrieval benchmark, and an improvement of 7 % over the traditional BoW is reported
Structuring lecture videos for distance learning applications. ISMSE
This paper presents an automatic and novel approach in structuring and indexing lecture videos for distance learning applications. By structuring video content, we can support both topic indexing and semantic querying of multimedia documents. In this paper, our aim is to link the discussion topics extracted from the electronic slides with their associated video and audio segments. Two major techniques in our proposed approach include video text analysis and speech recognition. Initially, a video is partitioned into shots based on slide transitions. For each shot, the embedded video texts are detected, reconstructed and segmented as high-resolution foreground texts for commercial OCR recognition. The recognized texts can then be matched with their associated slides for video indexing. Meanwhile, both phrases (title) and keywords (content) are also extracted from the electronic slides to spot the speech signals. The spotted phrases and keywords are further utilized as queries to retrieve the most similar slide for speech indexing. 1
Exploring Object Relation in Mean Teacher for Cross-Domain Detection
Rendering synthetic data (e.g., 3D CAD-rendered images) to generate
annotations for learning deep models in vision tasks has attracted increasing
attention in recent years. However, simply applying the models learnt on
synthetic images may lead to high generalization error on real images due to
domain shift. To address this issue, recent progress in cross-domain
recognition has featured the Mean Teacher, which directly simulates
unsupervised domain adaptation as semi-supervised learning. The domain gap is
thus naturally bridged with consistency regularization in a teacher-student
scheme. In this work, we advance this Mean Teacher paradigm to be applicable
for cross-domain detection. Specifically, we present Mean Teacher with Object
Relations (MTOR) that novelly remolds Mean Teacher under the backbone of Faster
R-CNN by integrating the object relations into the measure of consistency cost
between teacher and student modules. Technically, MTOR firstly learns
relational graphs that capture similarities between pairs of regions for
teacher and student respectively. The whole architecture is then optimized with
three consistency regularizations: 1) region-level consistency to align the
region-level predictions between teacher and student, 2) inter-graph
consistency for matching the graph structures between teacher and student, and
3) intra-graph consistency to enhance the similarity between regions of same
class within the graph of student. Extensive experiments are conducted on the
transfers across Cityscapes, Foggy Cityscapes, and SIM10k, and superior results
are reported when comparing to state-of-the-art approaches. More remarkably, we
obtain a new record of single model: 22.8% of mAP on Syn2Real detection
dataset.Comment: CVPR 2019; The codes and model of our MTOR are publicly available at:
https://github.com/caiqi/mean-teacher-cross-domain-detectio
Long-term Leap Attention, Short-term Periodic Shift for Video Classification
Video transformer naturally incurs a heavier computation burden than a static
vision transformer, as the former processes times longer sequence than the
latter under the current attention of quadratic complexity . The
existing works treat the temporal axis as a simple extension of spatial axes,
focusing on shortening the spatio-temporal sequence by either generic pooling
or local windowing without utilizing temporal redundancy.
However, videos naturally contain redundant information between neighboring
frames; thereby, we could potentially suppress attention on visually similar
frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a
long-term ``\textbf{\textit{Leap Attention}}'' (LA), short-term
``\textbf{\textit{Periodic Shift}}'' (\textit{P}-Shift) module for video
transformers, with complexity. Specifically, the ``LA'' groups
long-term frames into pairs, then refactors each discrete pair via attention.
The ``\textit{P}-Shift'' exchanges features between temporal neighbors to
confront the loss of short-term dynamics. By replacing a vanilla 2D attention
with the LAPS, we could adapt a static transformer into a video one, with zero
extra parameters and neglectable computation overhead (2.6\%).
Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS
transformer could achieve competitive performances in terms of accuracy, FLOPs,
and Params among CNN and transformer SOTAs. We open-source our project in
\sloppy
\href{https://github.com/VideoNetworks/LAPS-transformer}{\textit{\color{magenta}{https://github.com/VideoNetworks/LAPS-transformer}}} .Comment: Accepted by ACM Multimedia 2022, 10 pages, 4 figure
Fusing semantics, observability, reliability and diversity of concept detectors for video search
ABSTRACT Effective utilization of semantic concept detectors for largescale video search has recently become a topic of intensive studies. One of main challenges is the selection and fusion of appropriate detectors, which considers not only semantics but also the reliability of detectors, observability and diversity of detectors in target video domains. In this paper, we present a novel fusion technique which considers different aspects of detectors for query answering. In addition to utilizing detectors for bridging the semantic gap of user queries and multimedia data, we also address the issue of "observability gap" among detectors which could not be directly inferred from semantic reasoning such as using ontology. To facilitate the selection of detectors, we propose the building of two vector spaces: semantic space (SS) and observability space (OS). We categorize the set of detectors selected separately from SS and OS into four types: anchor, bridge, positive and negative concepts. A multi-level fusion strategy is proposed to novelly combine detectors, allowing the enhancement of detector reliability while enabling the observability, semantics and diversity of concepts being utilized for query answering. By experimenting the proposed approach on TRECVID 2005-2007 datasets and queries, we demonstrate the significance of considering observability, reliability and diversity, in addition to the semantics of detectors to queries
OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation
In the realm of food computing, segmenting ingredients from images poses
substantial challenges due to the large intra-class variance among the same
ingredients, the emergence of new ingredients, and the high annotation costs
associated with large food segmentation datasets. Existing approaches primarily
utilize a closed-vocabulary and static text embeddings setting. These methods
often fall short in effectively handling the ingredients, particularly new and
diverse ones. In response to these limitations, we introduce OVFoodSeg, a
framework that adopts an open-vocabulary setting and enhances text embeddings
with visual context. By integrating vision-language models (VLMs), our approach
enriches text embedding with image-specific information through two innovative
modules, eg, an image-to-text learner FoodLearner and an Image-Informed Text
Encoder. The training process of OVFoodSeg is divided into two stages: the
pre-training of FoodLearner and the subsequent learning phase for segmentation.
The pre-training phase equips FoodLearner with the capability to align visual
information with corresponding textual representations that are specifically
related to food, while the second phase adapts both the FoodLearner and the
Image-Informed Text Encoder for the segmentation task. By addressing the
deficiencies of previous models, OVFoodSeg demonstrates a significant
improvement, achieving an 4.9\% increase in mean Intersection over Union (mIoU)
on the FoodSeg103 dataset, setting a new milestone for food image segmentation.Comment: CVPR 2024; 12 page
On the Selection of Anchors and Targets for Video Hyperlinking
A problem not well understood in video hyperlinking is what qualifies a
fragment as an anchor or target. Ideally, anchors provide good starting points
for navigation, and targets supplement anchors with additional details while
not distracting users with irrelevant, false and redundant information. The
problem is not trivial for intertwining relationship between data
characteristics and user expectation. Imagine that in a large dataset, there
are clusters of fragments spreading over the feature space. The nature of each
cluster can be described by its size (implying popularity) and structure
(implying complexity). A principle way of hyperlinking can be carried out by
picking centers of clusters as anchors and from there reach out to targets
within or outside of clusters with consideration of neighborhood complexity.
The question is which fragments should be selected either as anchors or
targets, in one way to reflect the rich content of a dataset, and meanwhile to
minimize the risk of frustrating user experience. This paper provides some
insights to this question from the perspective of hubness and local intrinsic
dimensionality, which are two statistical properties in assessing the
popularity and complexity of data space. Based these properties, two novel
algorithms are proposed for low-risk automatic selection of anchors and
targets.Comment: ACM International Conference on Multimedia Retrieval (ICMR), 2017.
(Oral
Predicting domain adaptivity: Redo or recycle
ABSTRACT Over the years, the academic researchers have contributed various visual concept classifiers. Nevertheless, given a new dataset, most researchers still prefer to develop large number of classifiers from scratch despite expensive labeling efforts and limited computing resources. A valid question is why not multimedia community "embrace the green" and recycle off-the-shelf classifiers for new dataset. The difficulty originates from the domain gap that there are many different factors that govern the development of a classifier and eventually drive its performance to emphasize certain aspects of dataset. Reapplying a classifier to an unseen dataset may end up GIGO (garbage in, garbage out) and the performance could be much worse than re-developing a new classifier with very few training examples. In this paper, we explore different parameters, including shift of data distribution, visual and context diversities, that may hinder or otherwise encourage the recycling of old classifiers for new dataset. Particularly, we give empirical insights of when to recycle available resources, and when to redo from scratch by completely forgetting the past and train a brand new classifier. Based on these analysis, we further propose an approach for predicting the negative transfer of a concept classifier to a different domain given the observed parameters. Experimental results show that the prediction accuracy of over 75% can be achieved when transferring concept classifiers learnt from LSCOM (news video domain), ImageNet (Web image domain) and Flickr-SF (weakly tagged Web image domain), respectively, to TRECVID 2011 dataset (Web video domain)
- …