Search CORE

1,906 research outputs found

Unsupervised video indexing on audiovisual characterization of persons

Author: El-khoury Elie
Publication venue
Publication date: 03/06/2010
Field of study

Cette thèse consiste à proposer une méthode de caractérisation non-supervisée des intervenants dans les documents audiovisuels, en exploitant des données liées à leur apparence physique et à leur voix. De manière générale, les méthodes d'identification automatique, que ce soit en vidéo ou en audio, nécessitent une quantité importante de connaissances a priori sur le contenu. Dans ce travail, le but est d'étudier les deux modes de façon corrélée et d'exploiter leur propriété respective de manière collaborative et robuste, afin de produire un résultat fiable aussi indépendant que possible de toute connaissance a priori. Plus particulièrement, nous avons étudié les caractéristiques du flux audio et nous avons proposé plusieurs méthodes pour la segmentation et le regroupement en locuteurs que nous avons évaluées dans le cadre d'une campagne d'évaluation. Ensuite, nous avons mené une étude approfondie sur les descripteurs visuels (visage, costume) qui nous ont servis à proposer de nouvelles approches pour la détection, le suivi et le regroupement des personnes. Enfin, le travail s'est focalisé sur la fusion des données audio et vidéo en proposant une approche basée sur le calcul d'une matrice de cooccurrence qui nous a permis d'établir une association entre l'index audio et l'index vidéo et d'effectuer leur correction. Nous pouvons ainsi produire un modèle audiovisuel dynamique des intervenants.This thesis consists to propose a method for an unsupervised characterization of persons within audiovisual documents, by exploring the data related for their physical appearance and their voice. From a general manner, the automatic recognition methods, either in video or audio, need a huge amount of a priori knowledge about their content. In this work, the goal is to study the two modes in a correlated way and to explore their properties in a collaborative and robust way, in order to produce a reliable result as independent as possible from any a priori knowledge. More particularly, we have studied the characteristics of the audio stream and we have proposed many methods for speaker segmentation and clustering and that we have evaluated in a french competition. Then, we have carried a deep study on visual descriptors (face, clothing) that helped us to propose novel approches for detecting, tracking, and clustering of people within the document. Finally, the work was focused on the audiovisual fusion by proposing a method based on computing the cooccurrence matrix that allowed us to establish an association between audio and video indexes, and to correct them. That will enable us to produce a dynamic audiovisual model for each speaker

Thèses en ligne de l'Université Toulouse III - Paul Sabatier

Approximate Nearest Neighbor Fields in Video

Author: Ben-Zrihem Nir
Zelnik-Manor Lihi
Publication venue
Publication date: 31/08/2015
Field of study

We introduce RIANN (Ring Intersection Approximate Nearest Neighbor search), an algorithm for matching patches of a video to a set of reference patches in real-time. For each query, RIANN finds potential matches by intersecting rings around key points in appearance space. Its search complexity is reversely correlated to the amount of temporal change, making it a good fit for videos, where typically most patches change slowly with time. Experiments show that RIANN is up to two orders of magnitude faster than previous ANN methods, and is the only solution that operates in real-time. We further demonstrate how RIANN can be used for real-time video processing and provide examples for a range of real-time video applications, including colorization, denoising, and several artistic effects.Comment: A CVPR 2015 oral pape

arXiv.org e-Print Archive

Crossref

Interactive Video Annotation Tool

Author: García Jesús
Molina José M.
Patricio Guisado Miguel Ángel
Serrano Miguel Á.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Proceedings of: Forth International Workshop on User-Centric Technologies and applications (CONTEXTS 2010). Valencia, 7-10 September , 2010.Abstract: Increasingly computer vision discipline needs annotated video databases to realize assessment tasks. Manually providing ground truth data to multimedia resources is a very expensive work in terms of effort, time and economic resources. Automatic and semi-automatic video annotation and labeling is the faster and more economic way to get ground truth for quite large video collections. In this paper, we describe a new automatic and supervised video annotation tool. Annotation tool is a modified version of ViPER-GT tool. ViPER-GT standard version allows manually editing and reviewing video metadata to generate assessment data. Automatic annotation capability is possible thanks to an incorporated tracking system which can deal the visual data association problem in real time. The research aim is offer a system which enables spends less time doing valid assessment models.Publicad

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

Author: Bao Linchao
He Shengfeng
Jiao Jianbo
Liu Wei
Liu Yunhui
Wang Jiangliu
Publication venue
Publication date: 01/01/2019
Field of study

We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a frame-by-frame basis, which are not applicable to many video analytic tasks where spatio-temporal features are prevailing. In this paper we propose a novel self-supervised approach to learn spatio-temporal features for video representation. Inspired by the success of two-stream approaches in video classification, we propose to learn visual features by regressing both motion and appearance statistics along spatial and temporal dimensions, given only the input video data. Specifically, we extract statistical concepts (fast-motion region and the corresponding dominant direction, spatio-temporal color diversity, dominant color, etc.) from simple patterns in both spatial and temporal domains. Unlike prior puzzles that are even hard for humans to solve, the proposed approach is consistent with human inherent visual habits and therefore easy to answer. We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach. The experiments show that our approach can significantly improve the performance of C3D when applied to video classification tasks. Code is available at https://github.com/laura-wang/video_repres_mas.Comment: CVPR 201

arXiv.org e-Print Archive

Crossref

University of Birmingham Research Portal

Institutional Knowledge at Singapore Management University