248 research outputs found
Learning to detect video events from zero or very few video examples
In this work we deal with the problem of high-level event detection in video.
Specifically, we study the challenging problems of i) learning to detect video
events from solely a textual description of the event, without using any
positive video examples, and ii) additionally exploiting very few positive
training samples together with a small number of ``related'' videos. For
learning only from an event's textual description, we first identify a general
learning framework and then study the impact of different design choices for
various stages of this framework. For additionally learning from example
videos, when true positive training samples are scarce, we employ an extension
of the Support Vector Machine that allows us to exploit ``related'' event
videos by automatically introducing different weights for subsets of the videos
in the overall training set. Experimental evaluations performed on the
large-scale TRECVID MED 2014 video dataset provide insight on the effectiveness
of the proposed methods.Comment: Image and Vision Computing Journal, Elsevier, 2015, accepted for
publicatio
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval
In this paper we tackle the cross-modal video retrieval problem and, more
specifically, we focus on text-to-video retrieval. We investigate how to
optimally combine multiple diverse textual and visual features into feature
pairs that lead to generating multiple joint feature spaces, which encode
text-video pairs into comparable representations. To learn these
representations our proposed network architecture is trained by following a
multiple space learning procedure. Moreover, at the retrieval stage, we
introduce additional softmax operations for revising the inferred query-video
similarities. Extensive experiments in several setups based on three
large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to
best combine text-visual features and document the performance of the proposed
network. Source code is made publicly available at:
https://github.com/bmezaris/TextToVideoRetrieval-TtimesVComment: Accepted for publication; to be included in Proc. ECCV Workshops
2022. The version posted here is the "submitted manuscript" versio
Combining textual and visual information processing for interactive video retrieval: SCHEMA's participation in TRECVID 2004
In this paper, the two different applications based on the Schema Reference System that were developed by the SCHEMA NoE for participation to the search task of TRECVID 2004 are illustrated. The first application, named ”Schema-Text”, is an interactive retrieval application that employs only textual information while the second one, named ”Schema-XM”, is an extension of the former, employing algorithms and
methods for combining textual, visual and higher level information. Two runs for each application were submitted, I A 2 SCHEMA-Text 3, I A 2 SCHEMA-Text 4 for Schema-Text and I A 2 SCHEMA-XM 1, I A 2 SCHEMA-XM 2 for Schema-XM. The comparison of these two applications in terms of retrieval efficiency revealed that the combination of information from different data sources can provide higher efficiency for retrieval systems. Experimental testing additionally revealed that initially performing a text-based query and subsequently proceeding with visual similarity search using one of the returned relevant keyframes as an example image is a good scheme for combining visual and textual information
Combination of Accumulated Motion and Color Segmentation for Human Activity Analysis
The automated analysis of activity in digital multimedia, and especially video, is gaining more and more importance due to the evolution of higher-level video processing systems and the development of relevant applications such as surveillance and sports. This paper presents a novel algorithm for the recognition and classification of human activities, which employs motion and color characteristics in a complementary manner, so as to extract the most information from both sources, and overcome their individual limitations. The proposed method accumulates the flow estimates in a video, and extracts “regions of activity†by processing their higher-order statistics. The shape of these activity areas can be used for the classification of the human activities and events taking place in a video and the subsequent extraction of higher-level semantics. Color segmentation of the active and static areas of each video frame is performed to complement this information. The color layers in the activity and background areas are compared using the earth mover's distance, in order to achieve accurate object segmentation. Thus, unlike much existing work on human activity analysis, the proposed approach is based on general color and motion processing methods, and not on specific models of the human body and its kinematics. The combined use of color and motion information increases the method robustness to illumination variations and measurement noise. Consequently, the proposed approach can lead to higher-level information about human activities, but its applicability is not limited to specific human actions. We present experiments with various real video sequences, from sports and surveillance domains, to demonstrate the effectiveness of our approach
Learning Visual Explanations for DCNN-Based Image Classifiers Using an Attention Mechanism
In this paper two new learning-based eXplainable AI (XAI) methods for deep
convolutional neural network (DCNN) image classifiers, called L-CAM-Fm and
L-CAM-Img, are proposed. Both methods use an attention mechanism that is
inserted in the original (frozen) DCNN and is trained to derive class
activation maps (CAMs) from the last convolutional layer's feature maps. During
training, CAMs are applied to the feature maps (L-CAM-Fm) or the input image
(L-CAM-Img) forcing the attention mechanism to learn the image regions
explaining the DCNN's outcome. Experimental evaluation on ImageNet shows that
the proposed methods achieve competitive results while requiring a single
forward pass at the inference stage. Moreover, based on the derived
explanations a comprehensive qualitative analysis is performed providing
valuable insight for understanding the reasons behind classification errors,
including possible dataset biases affecting the trained classifier.Comment: Accepted for publication; to be included in Proc. ECCV Workshops
2022. The version posted here is the "submitted manuscript" versio
Masked Feature Modelling: Feature Masking for the Unsupervised Pre-training of a Graph Attention Network Block for Bottom-up Video Event Recognition
In this paper, we introduce Masked Feature Modelling (MFM), a novel approach
for the unsupervised pre-training of a Graph Attention Network (GAT) block. MFM
utilizes a pretrained Visual Tokenizer to reconstruct masked features of
objects within a video, leveraging the MiniKinetics dataset. We then
incorporate the pre-trained GAT block into a state-of-the-art bottom-up
supervised video-event recognition architecture, ViGAT, to improve the model's
starting point and overall accuracy. Experimental evaluations on the YLI-MED
dataset demonstrate the effectiveness of MFM in improving event recognition
performance.Comment: 8 page
Automatic Synchronization of Multi-User Photo Galleries
In this paper we address the issue of photo galleries synchronization, where
pictures related to the same event are collected by different users. Existing
solutions to address the problem are usually based on unrealistic assumptions,
like time consistency across photo galleries, and often heavily rely on
heuristics, limiting therefore the applicability to real-world scenarios. We
propose a solution that achieves better generalization performance for the
synchronization task compared to the available literature. The method is
characterized by three stages: at first, deep convolutional neural network
features are used to assess the visual similarity among the photos; then, pairs
of similar photos are detected across different galleries and used to construct
a graph; eventually, a probabilistic graphical model is used to estimate the
temporal offset of each pair of galleries, by traversing the minimum spanning
tree extracted from this graph. The experimental evaluation is conducted on
four publicly available datasets covering different types of events,
demonstrating the strength of our proposed method. A thorough discussion of the
obtained results is provided for a critical assessment of the quality in
synchronization.Comment: ACCEPTED to IEEE Transactions on Multimedi
Linear Maximum Margin Classifier for Learning from Uncertain Data
In this paper, we propose a maximum margin classifier that deals with
uncertainty in data input. More specifically, we reformulate the SVM framework
such that each training example can be modeled by a multi-dimensional Gaussian
distribution described by its mean vector and its covariance matrix -- the
latter modeling the uncertainty. We address the classification problem and
define a cost function that is the expected value of the classical SVM cost
when data samples are drawn from the multi-dimensional Gaussian distributions
that form the set of the training examples. Our formulation approximates the
classical SVM formulation when the training examples are isotropic Gaussians
with variance tending to zero. We arrive at a convex optimization problem,
which we solve efficiently in the primal form using a stochastic gradient
descent approach. The resulting classifier, which we name SVM with Gaussian
Sample Uncertainty (SVM-GSU), is tested on synthetic data and five publicly
available and popular datasets; namely, the MNIST, WDBC, DEAP, TV News Channel
Commercial Detection, and TRECVID MED datasets. Experimental results verify the
effectiveness of the proposed method.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence. (c)
2017 IEEE. DOI: 10.1109/TPAMI.2017.2772235 Author's accepted version. The
final publication is available at
http://ieeexplore.ieee.org/document/8103808
- …