    The crowd as a cameraman : on-stage display of crowdsourced mobile video at large-scale events

    Recording videos with smartphones at large-scale events such as concerts and festivals is very common nowadays. These videos register the atmosphere of the event as it is experienced by the crowd and offer a perspective that is hard to capture by the professional cameras installed throughout the venue. In this article, we present a framework to collect videos from smartphones in the public and blend these into a mosaic that can be readily mixed with professional camera footage and shown on displays during the event. The video upload is prioritized by matching requests of the event director with video metadata, while taking into account the available wireless network capacity. The proposed framework's main novelty is its scalability, supporting the real-time transmission, processing and display of videos recorded by hundreds of simultaneous users in ultra-dense Wi-Fi environments, as well as its proven integration in commercial production environments. The framework has been extensively validated in a controlled lab setting with up to 1 000 clients as well as in a field trial where 1 183 videos were collected from 135 participants recruited from an audience of 8 050 people. 90 % of those videos were uploaded within 6.8 minutes

    Visual search for musical performances and endoscopic videos

    [ANGLÈS] This project explores the potential of LIRE, an en existing Content-Based Image Retrieval (CBIR) system, when used to retrieve medical videos. These videos are recording of the live streams used by surgeons during the endoscopic procedures, captured from inside of the subject. The growth of such video content stored in servers requires search engines capable to assist surgeons in their management and retrieval. In our tool, queries are formulated by visual examples and those allow surgeons to re-find shots taken during the procedure. This thesis presents an extension and adaptation of Lire for video retrieval based on visual features and late fusion. The results are assessed from two perspectives: a quantitative and qualitative one. While the quantitative one follows the standard practices and metrics for video retrieval, the qualitative assessment has been based on an empirical social study using a semi-interactive web-interface. In particular, a thinking aloud test was applied to analyze if the user expectations and requirements were fulfilled. Due to the scarcity of surgeons available for the qualitative tests, a second domain was also addressed: videos captured at musical performances. These type of videos has also experienced an exponential growth with the advent of affordable multimedia smart phones, available to a large audience. Analogously to the endoscopic videos, searching in a large data set of such videos is a challenging topic.[CASTELLÀ] Este proyecto investiga el potencial de Lire, un sistema existente de recuperación basado en contenido de imagen (CBIR) utilizado en el dominio médico. Estos vídeos son grabaciones a tiempo real del interior de los pacientes y son utilizados por cirujanos durante las operaciones de endoscopia. La creciente demanda de este conjunto de vídeos que son almacenados en diferentes servidores, requiere nuevos motores de búsqueda capaces de dar soporte al trabajo de los médicos con su gestión y posterior recuperación cuando se necesite. En nuestra herramienta, las consultas son formuladas mediante ejemplos visuales. Esto permite a los cirujanos volver a encontrar los diferentes instantes capturados durante las intervenciones. En esta tesis se presenta una extensión y adaptación de Lire para la recuperación de vídeo basado en las características visuales y métodos de late fusion. Los resultados son evaluados desde dos perspectivas: una cuantitativa y una cualitativa. Mientras que la parte cuantitativa sigue el estándar de las prácticas y métricas empleadas en vídeo retrieval, la evaluación cualitativa ha sido basada en un estudio social empírico mediante una interfaz web semi-interactiva. Particularmente, se ha emprendido el método "thinking aloud test" para analizar si nuestra herramienta cumple con las expectativas y necesidades de los usuarios a la hora de utilizar la aplicación. Debido a la escasez de médicos disponibles para llevar a cabo las pruebas cualitativas, el trabajo se ha dirigido también a un segundo dominio: conjunto de vídeos de acontecimientos musicales. Este tipo de vídeos también ha experimentado un crecimiento exponencial con la llegada de los smart phones y se encuentran al alcance de un público muy amplio. Análogamente a los vídeos endoscópicos, hacer una busca en una gran base de datos de este tipo también es un tema difícil y motivo de estudio.[CATALÀ] Aquest projecte investiga el potencial de Lire, un sistema existent de recuperació basat en contingut d'imatge (CBIR) utilitzat en el domini mèdic. Aquests vídeos són enregistraments a temps real de l'interior dels pacients i són utilitzats per cirurgians durant les operacions d'endoscòpia. La creixent demanda d'aquest conjunt de vídeos que són emmagatzemats a diferents servidors, requereix nous motors de cerca capaços de donar suport a la feina dels metges amb la seva gestió i posterior recuperació quan es necessiti. A la nostra eina, les consultes són formulades mitjançant exemples visuals. Això permet als cirurgians tornar a trobar els diferents instants capturats durant la intervenció. En aquesta tesi es presenta una extensió i adaptació del Lire per a la recuperació de vídeo basat en característiques visuals i late fusion. Els resultats són avaluats des de dues perspectives: una quantitativa i una qualitativa. Mentre que la part quantitativa segueix l'estàndard de les pràctiques i mètriques per vídeo retrieval, l'avaluació qualitativa ha estat basada en un estudi social empíric mitjançant una interfície web semiinteractiva. Particularment, s'ha emprès el mètode "thinking aloud test" per analitzar si la nostra eina compleix amb les expectatives i necessitats dels usuaris a l'hora d'utilitzar l'aplicació. A causa de l'escassetat de metges disponibles per dur a terme les proves qualitatives, el treball s'ha adreçat també a un segon domini: conjunt de vídeos d'esdeveniments musicals. Aquest tipus de vídeos també ha experimentat un creixement exponencial amb l'arribada dels smart phones i es troben a l'abast d'un públic molt ampli. Anàlogament als vídeos endoscòpics, fer una cerca en una gran base de dades d'aquest tipus també és un tema difícil i motiu d'estudi

    Quality-aware Content Adaptation in Digital Video Streaming

    User-generated video has attracted a lot of attention due to the success of Video Sharing Sites such as YouTube and Online Social Networks. Recently, a shift towards live consumption of these videos is observable. The content is captured and instantly shared over the Internet using smart mobile devices such as smartphones. Large-scale platforms arise such as YouTube.Live, YouNow or Facebook.Live which enable the smartphones of users to livestream to the public. These platforms achieve the distribution of tens of thousands of low resolution videos to remote viewers in parallel. Nonetheless, the providers are not capable to guarantee an efficient collection and distribution of high-quality video streams. As a result, the user experience is often degraded, and the needed infrastructure installments are huge. Efficient methods are required to cope with the increasing demand for these video streams; and an understanding is needed how to capture, process and distribute the videos to guarantee a high-quality experience for viewers. This thesis addresses the quality awareness of user-generated videos by leveraging the concept of content adaptation. Two types of content adaptation, the adaptive video streaming and the video composition, are discussed in this thesis. Then, a novel approach for the given scenario of a live upload from mobile devices, the processing of video streams and their distribution is presented. This thesis demonstrates that content adaptation applied to each step of this scenario, ranging from the upload to the consumption, can significantly improve the quality for the viewer. At the same time, if content adaptation is planned wisely, the data traffic can be reduced while keeping the quality for the viewers high. The first contribution of this thesis is a better understanding of the perceived quality in user-generated video and its influencing factors. Subjective studies are performed to understand what affects the human perception, leading to the first of their kind quality models. Developed quality models are used for the second contribution of this work: novel quality assessment algorithms. A unique attribute of these algorithms is the usage of multiple features from different sensors. Whereas classical video quality assessment algorithms focus on the visual information, the proposed algorithms reduce the runtime by an order of magnitude when using data from other sensors in video capturing devices. Still, the scalability for quality assessment is limited by executing algorithms on a single server. This is solved with the proposed placement and selection component. It allows the distribution of quality assessment tasks to mobile devices and thus increases the scalability of existing approaches by up to 33.71% when using the resources of only 15 mobile devices. These three contributions are required to provide a real-time understanding of the perceived quality of the video streams produced on mobile devices. The upload of video streams is the fourth contribution of this work. It relies on content and mechanism adaptation. The thesis introduces the first prototypically evaluated adaptive video upload protocol (LiViU) which transcodes multiple video representations in real-time and copes with changing network conditions. In addition, a mechanism adaptation is integrated into LiViU to react to changing application scenarios such as streaming high-quality videos to remote viewers or distributing video with a minimal delay to close-by recipients. A second type of content adaptation is discussed in the fifth contribution of this work. An automatic video composition application is presented which enables live composition from multiple user-generated video streams. The proposed application is the first of its kind, allowing the in-time composition of high-quality video streams by inspecting the quality of individual video streams, recording locations and cinematographic rules. As a last contribution, the content-aware adaptive distribution of video streams to mobile devices is introduced by the Video Adaptation Service (VAS). The VAS analyzes the video content streamed to understand which adaptations are most beneficial for a viewer. It maximizes the perceived quality for each video stream individually and at the same time tries to produce as little data traffic as possible - achieving data traffic reduction of more than 80%

    Combining content analysis with usage analysis to better understand visual contents

    This thesis focuses on the problem of understanding visual contents, which can be images, videos or 3D contents. Understanding means that we aim at inferring semantic information about the visual content. The goal of our work is to study methods that combine two types of approaches: 1) automatic content analysis and 2) an analysis of how humans interact with the content (in other words, usage analysis). We start by reviewing the state of the art from both Computer Vision and Multimedia communities. Twenty years ago, the main approach was aiming at a fully automatic understanding of images. This approach today gives way to different forms of human intervention, whether it is through the constitution of annotated datasets, or by solving problems interactively (e.g. detection or segmentation), or by the implicit collection of information gathered from content usages. These different types of human intervention are at the heart of modern research questions: how to motivate human contributors? How to design interactive scenarii that will generate interactions that contribute to content understanding? How to check or ensure the quality of human contributions? How to aggregate human contributions? How to fuse inputs obtained from usage analysis with traditional outputs from content analysis? Our literature review addresses these questions and allows us to position the contributions of this thesis. In our first set of contributions we revisit the detection of important (or salient) regions through implicit feedback from users that either consume or produce visual contents. In 2D, we develop several interfaces of interactive video (e.g. zoomable video) in order to coordinate content analysis and usage analysis. We also generalize these results to 3D by introducing a new detector of salient regions that builds upon simultaneous video recordings of the same public artistic performance (dance show, chant, etc.) by multiple users. The second contribution of our work aims at a semantic understanding of fixed images. With this goal in mind, we use data gathered through a game, Ask’nSeek, that we created. Elementary interactions (such as clicks) together with textual input data from players are, as before, mixed with automatic analysis of images. In particular, we show the usefulness of interactions that help revealing spatial relations between different objects in a scene. After studying the problem of detecting objects on a scene, we also adress the more ambitious problem of segmentation

    Compréhension de contenus visuels par analyse conjointe du contenu et des usages

    Semi-Automation in Video Editing

    Semi-automasjon i video redigering Hvordan kan vi bruke kunstig intelligens (KI) og maskin læring til å gjøre videoredigering like enkelt som å redigere tekst? I denne avhandlingen vil jeg adressere problemet med å bruke KI i videoredigering fra et Menneskelig-KI interaksjons perspektiv, med fokus på å bruke KI til å støtte brukerne. Video er et audiovisuelt medium. Redigere videoer krever synkronisering av både det visuelle og det auditive med presise operasjoner helt ned på millisekund nivå. Å gjøre dette like enkelt som å redigere tekst er kanskje ikke mulig i dag. Men hvordan skal vi da støtte brukerne med KI og hva er utfordringene med å gjøre det? Det er fem hovedspørsmål som har drevet forskningen i denne avhandlingen. Hva er dagens "state-of-the-art" i KI støttet videoredigering? Hva er behovene og forventningene av fagfolkene om KI? Hva er påvirkningen KI har på effektiviteten og nøyaktigheten når det blir brukt på teksting? Hva er endringene i brukeropplevelsen når det blir brukt KI støttet teksting? Hvordan kan flere KI metoder bli brukt for å støtte beskjærings- og panoreringsoppgaver? Den første artikkelen av denne avhandlingen ga en syntese og kritisk gjennomgang av eksisterende arbeid med KI-baserte verktøy for videoredigering. Artikkelen ga også noen svar på hvordan og hva KI kan bli brukt til for å støtte brukere ved en undersøkelse utført av 14 fagfolk. Den andre studien presenterte en prototype av KI-støttet videoredigerings verktøy bygget på et eksisterende videoproduksjons program. I tillegg kom det en evaluasjon av både ytelse og brukeropplevelse på en KI-støttet teksting fra 24 nybegynnere. Den tredje studien beskrev et idiom-basert verktøy for å konvertere bredskjermsvideoer lagd for TV til smalere størrelsesforhold for mobil og sosiale medieplattformer. Den tredje studien utforsker også nye metoder for å utøve beskjæring og panorering ved å bruke fem forskjellige KI-modeller. Det ble også presentert en evaluering fra fem brukere. I denne avhandlingen brukte vi en brukeropplevelse og oppgave basert framgangsmåte, for å adressere det semi-automatiske i videoredigering.How can we use artificial intelligence (AI) and machine learning (ML) to make video editing as easy as "editing text''? In this thesis, this problem of using AI to support video editing is explored from the human--AI interaction perspective, with the emphasis on using AI to support users. Video is a dual-track medium with audio and visual tracks. Editing videos requires synchronization of these two tracks and precise operations at milliseconds. Making it as easy as editing text might not be currently possible. Then how should we support the users with AI, and what are the current challenges in doing so? There are five key questions that drove the research in this thesis. What is the start of the art in using AI to support video editing? What are the needs and expectations of video professionals from AI? What are the impacts on efficiency and accuracy of subtitles when AI is used to support subtitling? What are the changes in user experience brought on by AI-assisted subtitling? How can multiple AI methods be used to support cropping and panning task? In this thesis, we employed a user experience focused and task-based approach to address the semi-automation in video editing. The first paper of this thesis provided a synthesis and critical review of the existing work on AI-based tools for videos editing and provided some answers to how should and what more AI can be used in supporting users by a survey of 14 video professional. The second paper presented a prototype of AI-assisted subtitling built on a production grade video editing software. It is the first comparative evaluation of both performance and user experience of AI-assisted subtitling with 24 novice users. The third work described an idiom-based tool for converting wide screen videos made for television to narrower aspect ratios for mobile social media platforms. It explores a new method to perform cropping and panning using five AI models, and an evaluation with 5 users and a review with a professional video editor were presented.Doktorgradsavhandlin