4 research outputs found
UPC system for the 2016 MediaEval multimodal person discovery in broadcast TV task
The UPC system works by extracting monomodal signal segments (face tracks, speech segments) that overlap with the person names overlaid in the video signal. These segments are assigned directly with the name of the person and used as a reference to compare against the non-overlapping (unassigned) signal segments. This process is performed independently both on the speech and video signals. A simple fusion scheme is used to combine both monomodal annotations into a single one.Postprint (published version
Sistema multimodal per al reconeixement de persones en gravacions de TV
Recogniton of peple by speech and face in TV shows. Participation in the international competition Meiaeval 2016.The Project described in this document falls within the topic of person recognition in TV
recordings by mean of multimodal systems.
It has been developed as collaboration with image and audio processing groups in the
signal theory department in UPC. Thus, it is a project about the development of a person
recognition system in TV Broadcast videos implemented for the participation in the
Whorkshop Mediaeval 2016.
The aim of the competition consist in find the names of the people that appears and talks
in each shot of different videos of a given database. This discovery should be done in
totally unsupervised manner using only the information in each shot as image, audio or
text.
For this purpose, it has been proposed to implement the fusion of three monomodal
algorithms. These technologies will process the information present in the text, image and
audio in independent manner. Information after monomodal technologies will be fused
with objective of create e a multimodal algorithm able to tag the shots in the database.
Finally, this thesis is centered in the monomodal audio algorithm development for which it
has been proposed a tracking system based in i-vectors.El proyecto descrito en este documento se enmarca en el tópico del descubrimiento de
personas en vídeos de televisión mediante sistemas multimodales.
Forma parte de una colaboración con los grupos de procesado de audio e imagen del
departamento de Teoría de señal. Así pues, se trata del desarrollo de un sistema de
descubrimiento de personas en un entorno de televisión broadcast para la participación
en un Workshop denominado Mediaeval 2016.
El reto de la competición consiste en encontrar los nombres de las personas que
aparecen hablando en cada escena de los diferentes shows de un conjunto de videos.
Dicho descubrimiento debe realizarse de manera totalmente no supervisada, utilizando
únicamente la información presente en cada escena, como la imagen, el audio o el texto.
Para ello, se propone utilizar la fusión de tres algoritmos monomodales que procesen de
manera independiente el texto, la imagen y el audio, para lograr un sistema multimodal
capaz de etiquetar cada escena dada una base de datos.
Finalmente, el presente proyecto se centrará en el desarrollo del algoritmo de audio a
nivel monomodal, para el cual se ha propuesto la implementación de un sistema de
tracking basado en i-vectors
UPC system for the 2016 MediaEval multimodal person discovery in broadcast TV task
The UPC system works by extracting monomodal signal segments (face tracks, speech segments) that overlap with the person names overlaid in the video signal. These segments are assigned directly with the name of the person and used as a reference to compare against the non-overlapping (unassigned) signal segments. This process is performed independently both on the speech and video signals. A simple fusion scheme is used to combine both monomodal annotations into a single one
UPC system for the 2016 MediaEval multimodal person discovery in broadcast TV task
The UPC system works by extracting monomodal signal segments (face tracks, speech segments) that overlap with the person names overlaid in the video signal. These segments are assigned directly with the name of the person and used as a reference to compare against the non-overlapping (unassigned) signal segments. This process is performed independently both on the speech and video signals. A simple fusion scheme is used to combine both monomodal annotations into a single one