Search CORE

17 research outputs found

Video Content Understanding Using Text

Author: Mazaheri Amir
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2020
Field of study

The rise of the social media and video streaming industry provided us a plethora of videos and their corresponding descriptive information in the form of concepts (words) and textual video captions. Due to the mass amount of available videos and the textual data, today is the best time ever to study the Computer Vision and Machine Learning problems related to videos and text. In this dissertation, we tackle multiple problems associated with the joint understanding of videos and text. We first address the task of multi-concept video retrieval, where the input is a set of words as concepts, and the output is a ranked list of full-length videos. This approach deals with multi-concept input and prolonged length of videos by incorporating multi-latent variables to tie the information within each shot (short clip of a full-video) and across shots. Secondly, we address the problem of video question answering, in which, the task is to answer a question, in the form of Fill-In-the-Blank (FIB), given a video. Answering a question is a task of retrieving a word from a dictionary (all possible words suitable for an answer) based on the input question and video. Following the FIB problem, we introduce a new problem, called Visual Text Correction (VTC), i.e., detecting and replacing an inaccurate word in the textual description of a video. We propose a deep network that can simultaneously detect an inaccuracy in a sentence while benefiting 1D-CNNs/LSTMs to encode short/long term dependencies, and fix it by replacing the inaccurate word(s). Finally, as the last part of the dissertation, we propose to tackle the problem of video generation using user input natural language sentences. Our proposed video generation method constructs two distributions out of the input text, corresponding to the first and last frames latent representations. We generate high-fidelity videos by interpolating latent representations and a sequence of CNN based up-pooling blocks

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Trecvid 2019: an evaluation campaign to benchmark video activity detection, video captioning and matching, and video search & retrieval

Author: Awad George M.
Butt Asad A.
Delgado Andrew
Fiscus Jon
Godil Afzal
Graham Yvette
Lee Yooyoung
Smeaton Alan F.
Publication venue
Publication date: 12/11/2019
Field of study

DCU Online Research Access Service

The AXES submissions at TrecVid 2013

Author: Aly Robin
Arandjelovic Relja
Chatfield Ken
Douze Matthijs
Fernando Basura
Harchaoui Zaid
McGuinness Kevin
O'Connor Noel E.
Oneata Dan
Parkhi Omkar M.
Potapov Danila
Revaud Jérôme
Schmid Cordelia
Schwenninger Jochen
Scott David
Tuytelaars Tinne
Verbeek Jakob
Wang Heng
Zisserman Andrew
Publication venue
Publication date: 01/11/2013
Field of study

The AXES project participated in the interactive instance search task (INS), the semantic indexing task (SIN) the multimedia event recounting task (MER), and the multimedia event detection task (MED) for TRECVid 2013. Our interactive INS focused this year on using classifiers trained at query time with positive examples collected from external search engines. Participants in our INS experiments were carried out by students and researchers at Dublin City University. Our best INS runs performed on par with the top ranked INS runs in terms of P@10 and P@30, and around the median in terms of mAP. For SIN, MED and MER, we use systems based on state- of-the-art local low-level descriptors for motion, image, and sound, as well as high-level features to capture speech and text and the visual and audio stream respectively. The low-level descriptors were aggregated by means of Fisher vectors into high- dimensional video-level signatures, the high-level features are aggregated into bag-of-word histograms. Using these features we train linear classifiers, and use early and late-fusion to combine the different features. Our MED system achieved the best score of all submitted runs in the main track, as well as in the ad-hoc track. This paper describes in detail our INS, MER, and MED systems and the results and findings of our experimen

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Irish Universities

DCU Online Research Access Service

HAL-Rennes 1

TRECVID 2018: Benchmarking Video Activity Detection, Video Captioning and Matching, Video Storytelling Linking and Video Search

Author: Awad George
Blasi Saverio
Butt Asad,
Curtis Keith
Delgado Andrew
Fiscus Jonathan
Godil Afzad
Graham Yvette
Joy David
Kraaij Wessel
Lee Yooyoung
Magalhaes Joao
Quénot Georges
Semedo David
Smeaton Alan,
Publication venue: HAL CCSD
Publication date: 13/11/2018
Field of study

International audienc

Hal - Université Grenoble Alpes

TRECVID 2014 -- An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics

Author: Awad George
Fiscus Jon
Joy David
Kraaij Wessel
Michel Martial
Over Paul
Quénot Georges
Sanders Greg
Smeaton Alan,
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

International audienceThe TREC Video Retrieval Evaluation (TRECVID) 2014 was a TREC-style video analysis and retrieval evaluation, the goal of which remains to promote progress in content-based exploitation of digital video via open, metrics-based evaluation. Over the last dozen years this effort has yielded a better under- standing of how systems can effectively accomplish such processing and how one can reliably benchmark their performance. TRECVID is funded by the NIST with support from other US government agencies. Many organizations and individuals worldwide contribute significant time and effort

Hal - Université Grenoble Alpes

TRECVID 2015 – An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics

Author: Aly Robin
Awad George
Fiscus Jon
Joy David
Kraaij Wessel
Michel Martial
Ordelman Roeland
Over Paul
Quénot Georges
Smeaton Alan,
Publication venue: HAL CCSD
Publication date: 16/11/2015
Field of study

International audienc

Audiovisual Speaker Clustering for News Broadcast Videos

Author: Kayal Subhradeep
Publication venue
Publication date: 10/06/2015
Field of study

Aaltodoc Publication Archive

Deep Reinforcement Sequence Learning for Visual Captioning

Author: Laria Mantecón Héctor
Publication venue
Publication date: 19/08/2019
Field of study

Methods to describe an image or video with natural language, namely image and video captioning, have recently converged into an encoder-decoder architecture. The encoder here is a deep convolutional neural network (CNN) that learns a fixed-length representation of the input image, and the decoder is a recurrent neural network (RNN), initialised with this representation, that generates a description of the scene in natural language. Traditional training mechanisms for this architecture usually optimise models using cross-entropy loss, which experiences two major problems. First, it inherently presents exposure bias (the model is only exposed to real descriptions, not to its own words), causing an incremental error in test time. Second, the ultimate objective is not directly optimised because the scoring metrics cannot be used in the procedure, as they are non-differentiable. New applications of reinforcement learning algorithms, such as self-critical training, overcome the exposure bias, while directly optimising non-differentiable sequence-based test metrics. This thesis reviews and analyses the performance of these different optimisation algorithms. Experiments on self-critic loss denote the importance of robust metrics against gaming to be used as the reward for the model, otherwise the qualitative performance is completely undermined. Sorting that out, the results do not reflect a huge quality improvement, but rather the expressiveness worsens and the vocabulary moves closer to what the reference uses. Subsequent experiments with a greatly improved encoder result in a marginal enhancing of the overall results, suggesting that the policy obtained is shown to be heavily constrained by the decoder language model. The thesis concludes that further analysis with higher capacity language models needs to be performed

Aaltodoc Publication Archive