21 research outputs found
Learning to detect video events from zero or very few video examples
In this work we deal with the problem of high-level event detection in video.
Specifically, we study the challenging problems of i) learning to detect video
events from solely a textual description of the event, without using any
positive video examples, and ii) additionally exploiting very few positive
training samples together with a small number of ``related'' videos. For
learning only from an event's textual description, we first identify a general
learning framework and then study the impact of different design choices for
various stages of this framework. For additionally learning from example
videos, when true positive training samples are scarce, we employ an extension
of the Support Vector Machine that allows us to exploit ``related'' event
videos by automatically introducing different weights for subsets of the videos
in the overall training set. Experimental evaluations performed on the
large-scale TRECVID MED 2014 video dataset provide insight on the effectiveness
of the proposed methods.Comment: Image and Vision Computing Journal, Elsevier, 2015, accepted for
publicatio
TRECVID 2014 -- An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics
International audienceThe TREC Video Retrieval Evaluation (TRECVID) 2014 was a TREC-style video analysis and retrieval evaluation, the goal of which remains to promote progress in content-based exploitation of digital video via open, metrics-based evaluation. Over the last dozen years this effort has yielded a better under- standing of how systems can effectively accomplish such processing and how one can reliably benchmark their performance. TRECVID is funded by the NIST with support from other US government agencies. Many organizations and individuals worldwide contribute significant time and effort
Deliverable D9.3 Final Project Report
This document comprises the final report of LinkedTV. It includes a publishable summary, a plan for use and dissemination of foreground and a report covering the wider societal implications of the project in the form of a questionnaire
Maximum Margin Learning Under Uncertainty
PhDIn this thesis we study the problem of learning under uncertainty using the statistical
learning paradigm. We rst propose a linear maximum margin classi er that deals
with uncertainty in data input. More speci cally, we reformulate the standard Support
Vector Machine (SVM) framework such that each training example can be modeled
by a multi-dimensional Gaussian distribution described by its mean vector and its
covariance matrix { the latter modeling the uncertainty. We address the classi cation
problem and de ne a cost function that is the expected value of the classical SVM
cost when data samples are drawn from the multi-dimensional Gaussian distributions
that form the set of the training examples. Our formulation approximates the classical
SVM formulation when the training examples are isotropic Gaussians with variance
tending to zero. We arrive at a convex optimization problem, which we solve e -
ciently in the primal form using a stochastic gradient descent approach. The resulting
classi er, which we name SVM with Gaussian Sample Uncertainty (SVM-GSU), is
tested on synthetic data and ve publicly available and popular datasets; namely, the
MNIST, WDBC, DEAP, TV News Channel Commercial Detection, and TRECVID
MED datasets. Experimental results verify the e ectiveness of the proposed method.
Next, we extended the aforementioned linear classi er so as to lead to non-linear decision
boundaries, using the RBF kernel. This extension, where we use isotropic input
uncertainty and we name Kernel SVM with Isotropic Gaussian Sample Uncertainty
(KSVM-iGSU), is used in the problems of video event detection and video aesthetic
quality assessment. The experimental results show that exploiting input uncertainty,
especially in problems where only a limited number of positive training examples are
provided, can lead to better classi cation, detection, or retrieval performance. Finally,
we present a preliminary study on how the above ideas can be used under the deep
convolutional neural networks learning paradigm so as to exploit inherent sources of
uncertainty, such as spatial pooling operations, that are usually used in deep networks
Deliverable D1.4 Visual, text and audio information analysis for hypervideo, final release
Having extensively evaluated the performance of the technologies included in the first release of WP1 multimedia analysis tools, using content from the LinkedTV scenarios and by participating in international benchmarking activities, concrete decisions regarding the appropriateness and the importance of each individual method or combination of methods were made, which, combined with an updated list of information needs for each scenario, led to a new set of analysis requirements that had to be addressed through the release of the final set of analysis techniques of WP1. To this end, coordinated efforts on three directions, including (a) the improvement of a number of methods in terms of accuracy and time efficiency, (b) the development of new technologies and (c) the definition of synergies between methods for obtaining new types of information via multimodal processing, resulted in the final bunch of multimedia analysis methods for video hyperlinking. Moreover, the different developed analysis modules have been integrated into a web-based infrastructure, allowing the fully automatic linking of the multitude of WP1 technologies and the overall LinkedTV platform
Deliverable D7.5 LinkedTV Dissemination and Standardisation Report v2
This deliverable presents the LinkedTV dissemination and standardisation report for the project period of months 19 to 30 (April 2013 to March 2014)
Machine Learning Architectures for Video Annotation and Retrieval
PhDIn this thesis we are designing machine learning methodologies for solving the problem
of video annotation and retrieval using either pre-defined semantic concepts or ad-hoc
queries. Concept-based video annotation refers to the annotation of video fragments
with one or more semantic concepts (e.g. hand, sky, running), chosen from a predefined concept list. Ad-hoc queries refer to textual descriptions that may contain
objects, activities, locations etc., and combinations of the former. Our contributions
are: i) A thorough analysis on extending and using different local descriptors towards
improved concept-based video annotation and a stacking architecture that uses in the
first layer, concept classifiers trained on local descriptors and improves their prediction
accuracy by implicitly capturing concept relations, in the last layer of the stack. ii)
A cascade architecture that orders and combines many classifiers, trained on different
visual descriptors, for the same concept. iii) A deep learning architecture that exploits
concept relations at two different levels. At the first level, we build on ideas from
multi-task learning, and propose an approach to learn concept-specific representations
that are sparse, linear combinations of representations of latent concepts. At a second
level, we build on ideas from structured output learning, and propose the introduction,
at training time, of a new cost term that explicitly models the correlations between
the concepts. By doing so, we explicitly model the structure in the output space
(i.e., the concept labels). iv) A fully-automatic ad-hoc video search architecture that
combines concept-based video annotation and textual query analysis, and transforms
concept-based keyframe and query representations into a common semantic embedding
space. Our architectures have been extensively evaluated on the TRECVID SIN 2013,
the TRECVID AVS 2016, and other large-scale datasets presenting their effectiveness
compared to other similar approaches
Deliverable D1.2 Visual, text and audio information analysis for hypervideo, first release
Enriching videos by offering continuative and related information via, e.g., audiostreams, web pages, as well as other videos, is typically hampered by its demand for massive editorial work. While there exist several automatic and semi-automatic methods that analyze audio/video content, one needs to decide which method offers appropriate information for our intended use-case scenarios. We review the technology options for video analysis that we have access to, and describe which training material we opted for to feed our algorithms. For all methods, we offer extensive qualitative and quantitative results, and give an outlook on the next steps within the project