1,447 research outputs found

    Automatic indexing of video content via the detection of semantic events

    Get PDF
    The number, and size, of digital video databases is continuously growing. Unfortunately, most, if not all, of the video content in these databases is stored without any sort of indexing or analysis and without any associated metadata. If any of the videos do have metadata, then it is usually the result of some manual annotation process rather than any automatic indexing. Thus, locating clips and browsing content is difficult, time consuming and generally inefficient. The task of automatically indexing movies is particularly difficult given their innovative creation process and the individual style of many film makers. However, there are a number of underlying film grammar conventions that are universally followed, from a Hollywood blockbuster to an underground movie with a limited budget. These conventions dictate many elements of film making such as camera placement and editing. By examining the use of these conventions it is possible to extract information about the events in a movie. This research aims to provide an approach that creates an indexed version of a movie to facilitate ease of browsing and efficient retrieval. In order to achieve this aim, all of the relevant events contained within a movie are detected and classified into a predefined index. The event detection process involves examining the underlying structure of a movie and utilising audiovisual analysis techniques, supported by machine learning algorithms, to extract information based on this structure. The result is an indexed movie that can be presented to users for browsing/retrieval of relevant events, as well as supporting user specified searching. Extensive evaluation of the indexing approach is carried out. This evaluation indicates efficient performance of the event detection and retrieval system, and also highlights the subjective nature of video content

    Interactive Fiction in Cinematic Virtual Reality: Epistemology, Creation and Evaluation

    Get PDF
    This dissertation presents the Interactive Fiction in Cinematic Virtual Reality (IFcVR), an interactive digital narrative (IDN) that brings together the cinematic virtual reality (cVR) and the creation of virtual environments through 360\ub0 video within an interactive fiction (IF) structure. This work is structured in three components: an epistemological approach to this kind of narrative and media hybrid; the creation process of IFcVR, from development to postproduction; and user evaluation of IFcVR. In order to set the foundations for the creation of interactive VR fiction films, I dissect the IFcVR by investigating the aesthetics, narratological and interactive notions that converge and diverge in it, proposing a medium-conscious narratology for this kind of artefact. This analysis led to the production of an IFcVR functional prototype: \u201cZENA\u201d, the first interactive VR film shot in Genoa. ZENA\u2019s creation process is reported proposing some guidelines for interactive and immersive film-makers. In order to evaluate the effectiveness of the IFcVR as an entertaining narrative form and a vehicle for diverse types of messages, this study also proposes a methodology to measure User Experience (UX) on IFcVR. The full evaluation protocol gathers both qualitative and quantitative data through ad hoc instruments. The proposed protocol is illustrated through its pilot application on ZENA. Findings show interactors' positive acceptance of IFcVR as an entertaining experience

    Human-machine cooperation in large-scale multimedia retrieval : a survey

    Get PDF
    Large-Scale Multimedia Retrieval(LSMR) is the task to fast analyze a large amount of multimedia data like images or videos and accurately find the ones relevant to a certain semantic meaning. Although LSMR has been investigated for more than two decades in the fields of multimedia processing and computer vision, a more interdisciplinary approach is necessary to develop an LSMR system that is really meaningful for humans. To this end, this paper aims to stimulate attention to the LSMR problem from diverse research fields. By explaining basic terminologies in LSMR, we first survey several representative methods in chronological order. This reveals that due to prioritizing the generality and scalability for large-scale data, recent methods interpret semantic meanings with a completely different mechanism from humans, though such humanlike mechanisms were used in classical heuristic-based methods. Based on this, we discuss human-machine cooperation, which incorporates knowledge about human interpretation into LSMR without sacrificing the generality and scalability. In particular, we present three approaches to human-machine cooperation (cognitive, ontological, and adaptive), which are attributed to cognitive science, ontology engineering, and metacognition, respectively. We hope that this paper will create a bridge to enable researchers in different fields to communicate about the LSMR problem and lead to a ground-breaking next generation of LSMR systems

    Balancing automation and user control in a home video editing system

    Get PDF
    The context of this PhD project is the area of multimedia content management, in particular interaction with home videos. Nowadays, more and more home videos are produced, shared and edited. Home videos are captured by amateur users, mainly to document their lives. People frequently edit home videos, to select and keep the best parts of their visual memories and to add to them a touch of personal creativity. However, most users find the current products for video editing timeconsuming and sometimes too technical and difficult. One reason of the large amount of time required for editing is the slow accessibility caused by the temporal dimension of videos: a video needs to be played back in order to be watched or edited. Another reason of the limitation of current video editing tools is that they are modelled too much on professional video editing systems, including technical details like frame-by-frame browsing. This thesis aims at making home video editing more efficient and easier for the non-technical, amateur user. To accomplish this goal, an approach was taken characterized by two main guidelines. We designed a semi-automatic tool, and we adopted a user-centered approach. To gain insights on user behaviours and needs related to home video editing, we designed an Internet-based survey, which was answered by 180 home video users. The results of the survey revealed the facts that video editing is done frequently and is seen as a very time-consuming activity. We also found that users with low experience with PCs often consider video editing programs too complex. Although nearly all commercial editing tools are designed for a PC, many of our respondents said to be interested in doing video editing on a TV. We created a novel concept, Edit While Watching, designed to be user-friendly. It requires only a TV set and a remote control, instead of a PC. The video that the user inputs to the system is automatically analyzed and structured in small video segments. The editing operations happen on the basis of these video segments: the user is not aware anymore of the single video frames. After the input video has been analyzed and structured, a first edited version is automatically prepared. Successively, Edit While Watching allows the user to modify and enrich the automatically edited video while watching it. When the user is satisfied, the video can be saved to a DVD or to another storage medium. We performed two iterations of system implementation and use testing to refine our concept. After the first iteration, we discovered that two requirements were insufficiently addressed: to have an overview of the video and to precisely control which video content to keep or to discard. The second version of EditWhileWatching was designed to address these points. It allows the user to visualize the video at three levels of detail: the different chapters (or scenes) of the video, the shots inside one chapter, and the timeline representation of a single shot. Also, the second version allows the users to edit the video at different levels of automation. For example, the user can choose an event in the video (e.g. a child playing with a toy) and just ask the system to automatically include more content related to it. Alternatively, if the user wants more control, he or she can precisely select which content to add to the video. We evaluated the second version of our tool by inviting nine users to edit their own home videos with it. The users judged Edit While Watching as an easy to use and fast application. However, some of them missed the possibility of enriching the video with transitions, music, text and pictures. Our test showed that the requirements of overview on the video and control in the selection of the edited material are better addressed than in the first version. Moreover, the participants were able to select which video portions to keep or to discard in a time close to the playback time of the video. The second version of Edit While Watching exploits different levels of automation. In some editing functions the user only gives an indication about editing a clip, and the system automatically decides the start and end points of the part of the video to be cut. However, there are also editing functions in which the user has complete control on the start and end points of a cut. We wanted to investigate how to balance automation and user control to optimize the perceived ease of use, the perceived control, the objective editing efficiency and the mental effort. To this aim, we implemented three types of editing functions, each type representing a different balance between automation and user control. To compare these three levels, we invited 25 users to perform pre-defined tasks with the three function types. The results showed that the type of functions with the highest level of automation performed worse than the two other types, according to both subjective and objective measurements. The other two types of functions were equally liked. However, some users clearly preferred the functions that allowed faster editing while others preferred the functions that gave full control and a more complete overview. In conclusion, on the basis of this research some design guidelines can be offered for building an easy and efficient video editing application. Such application should automatically structure the video, eliminate the detail about single frames, support a scalable video overview, implement a rich set of editing functionalities, and should be preferably TV-based

    Human-Machine Cooperation in Large-Scale Multimedia Retrieval: A Survey

    Get PDF
    Large-Scale Multimedia Retrieval(LSMR) is the task to fast analyze a large amount of multimedia data like images or videos and accurately find the ones relevant to a certain semantic meaning. Although LSMR has been investigated for more than two decades in the fields of multimedia processing and computer vision, a more interdisciplinary approach is necessary to develop an LSMR system that is really meaningful for humans. To this end, this paper aims to stimulate attention to the LSMR problem from diverse research fields. By explaining basic terminologies in LSMR, we first survey several representative methods in chronological order. This reveals that due to prioritizing the generality and scalability for large-scale data, recent methods interpret semantic meanings with a completely different mechanism from humans, though such humanlike mechanisms were used in classical heuristic-based methods. Based on this, we discuss human-machine cooperation, which incorporates knowledge about human interpretation into LSMR without sacrificing the generality and scalability. In particular, we present three approaches to human-machine cooperation (cognitive, ontological, and adaptive), which are attributed to cognitive science, ontology engineering, and metacognition, respectively. We hope that this paper will create a bridge to enable researchers in different fields to communicate about the LSMR problem and lead to a ground-breaking next generation of LSMR systems

    Identification, synchronisation and composition of user-generated videos

    Get PDF
    Cotutela Universitat Politècnica de Catalunya i Queen Mary University of LondonThe increasing availability of smartphones is facilitating people to capture videos of their experience when attending events such as concerts, sports competitions and public rallies. Smartphones are equipped with inertial sensors which could be beneficial for event understanding. The captured User-Generated Videos (UGVs) are made available on media sharing websites. Searching and mining of UGVs of the same event are challenging due to inconsistent tags or incorrect timestamps. A UGV recorded from a fixed location contains monotonic content and unintentional camera motions, which may make it less interesting to playback. In this thesis, we propose the following identification, synchronisation and video composition frameworks for UGVs. We propose a framework for the automatic identification and synchronisation of unedited multi-camera UGVs within a database. The proposed framework analyses the sound to match and cluster UGVs that capture the same spatio-temporal event, and estimate their relative time-shift to temporally align them. We design a novel descriptor derived from the pairwise matching of audio chroma features of UGVs. The descriptor facilitates the definition of a classification threshold for automatic query-by-example event identification. We contribute a database of 263 multi-camera UGVs of 48 real-world events. We evaluate the proposed framework on this database and compare it with state-of-the-art methods. Experimental results show the effectiveness of the proposed approach in the presence of audio degradations (channel noise, ambient noise, reverberations). Moreover, we present an automatic audio and visual-based camera selection framework for composing uninterrupted recording from synchronised multi-camera UGVs of the same event. We design an automatic audio-based cut-point selection method that provides a common reference for audio and video segmentation. To filter low quality video segments, spatial and spatio-temporal assessments are computed. The framework combines segments of UGVs using a rank-based camera selection strategy by considering visual quality scores and view diversity. The proposed framework is validated on a dataset of 13 events (93~UGVs) through subjective tests and compared with state-of-the-art methods. Suitable cut-point selection, specific visual quality assessments and rank-based camera selection contribute to the superiority of the proposed framework over the existing methods. Finally, we contribute a method for Camera Motion Detection using Gyroscope for UGVs captured from smartphones and design a gyro-based quality score for video composition. The gyroscope measures the angular velocity of the smartphone that can be use for camera motion analysis. We evaluate the proposed camera motion detection method on a dataset of 24 multi-modal UGVs captured by us, and compare it with existing visual and inertial sensor-based methods. By designing a gyro-based score to quantify the goodness of the multi-camera UGVs, we develop a gyro-based video composition framework. A gyro-based score substitutes the spatial and spatio-temporal scores and reduces the computational complexity. We contribute a multi-modal dataset of 3 events (12~UGVs), which is used to validate the proposed gyro-based video composition framework.El incremento de la disponibilidad de teléfonos inteligentes o smartphones posibilita a la gente capturar videos de sus experiencias cuando asisten a eventos así como como conciertos, competiciones deportivas o mítines públicos. Los Videos Generados por Usuarios (UGVs) pueden estar disponibles en sitios web públicos especializados en compartir archivos. La búsqueda y la minería de datos de los UGVs del mismo evento son un reto debido a que los etiquetajes son incoherentes o las marcas de tiempo erróneas. Por otra parte, un UGV grabado desde una ubicación fija, contiene información monótona y movimientos de cámara no intencionados haciendo menos interesante su reproducción. En esta tesis, se propone una identificación, sincronización y composición de tramas de vídeo para UGVs. Se ha propuesto un sistema para la identificación y sincronización automática de UGVs no editados provenientes de diferentes cámaras dentro de una base de datos. El sistema propuesto analiza el sonido con el fin de hacerlo coincidir e integrar UGVs que capturan el mismo evento en el espacio y en el tiempo, estimando sus respectivos desfases temporales y alinearlos en el tiempo. Se ha diseñado un nuevo descriptor a partir de la coincidencia por parejas de características de la croma del audio de los UGVs. Este descriptor facilita la determinación de una clasificación por umbral para una identificación de eventos automática basada en búsqueda mediante ejemplo (en inglés, query by example). Se ha contribuido con una base de datos de 263 multi-cámaras UGVs de un total de 48 eventos reales. Se ha evaluado la trama propuesta en esta base de datos y se ha comparado con los métodos elaborados en el estado del arte. Los resultados experimentales muestran la efectividad del enfoque propuesto con la presencia alteraciones en el audio. Además, se ha presentado una selección automática de tramas en base a la reproducción de video y audio componiendo una grabación ininterrumpida de multi-cámaras UGVs sincronizadas en el mismo evento. También se ha diseñado un método de selección de puntos de corte automático basado en audio que proporciona una referencia común para la segmentación de audio y video. Con el fin de filtrar segmentos de videos de baja calidad, se han calculado algunas medidas espaciales y espacio-temporales. El sistema combina segmentos de UGVs empleando una estrategia de selección de cámaras basadas en la evaluación a través de un ranking considerando puntuaciones de calidad visuales y diversidad de visión. El sistema propuesto se ha validado con un conjunto de datos de 13 eventos (93 UGVs) a través de pruebas subjetivas y se han comparado con los métodos elaborados en el estado del arte. La selección de puntos de corte adecuados, evaluaciones de calidad visual específicas y la selección de cámara basada en ranking contribuyen en la mejoría de calidad del sistema propuesto respecto a otros métodos existentes. Finalmente, se ha realizado un método para la Detección de Movimiento de Cámara usando giróscopos para las UGVs capturadas desde smartphones y se ha diseñado un método de puntuación de calidad basada en el giro. El método de detección de movimiento de la cámara con una base de datos de 24 UGVs multi-modales y se ha comparado con los métodos actuales basados en visión y sistemas inerciales. A través del diseño de puntuación para cuantificar con el giróscopo cuán bien funcionan los sistemas de UGVs con multi-cámara, se ha desarrollado un sistema de composición de video basada en el movimiento del giroscopio. Este sistema basado en la puntuación a través del giróscopo sustituye a los sistemas de puntuaciones basados en parámetros espacio-temporales reduciendo la complejidad computacional. Además, se ha contribuido con un conjunto de datos de 3 eventos (12 UGVs), que se han empleado para validar los sistemas de composición de video basados en giróscopo.Postprint (published version

    Segmentation sémantique des contenus audio-visuels

    Get PDF
    Dans ce travail, nous avons mis au point une méthode de segmentation des contenus audiovisuels applicable aux appareils de stockage domestiques pour cela nous avons expérimenté un système distribué pour l’analyse du contenu composé de modules individuels d’analyse : les Service Unit. L’un d’entre eux a été dédié à la caractérisation des éléments hors contenu, i.e. les publicités, et offre de bonnes performances. Parallèlement, nous avons testé différents détecteurs de changement de plans afin de retenir le meilleur d’entre eux pour la suite. Puis, nous avons proposé une étude des règles de production des films, i.e. grammaire de films, qui a permis de définir les séquences de Parallel Shot. Nous avons, ainsi, testé quatre méthodes de regroupement basées similarité afin de retenir la meilleure d’entre elles pour la suite. Finalement, nous avons recherché différentes méthodes de détection des frontières de scènes et avons obtenu les meilleurs résultats en combinant une méthode basée couleur avec un critère de longueur de plan. Ce dernier offre des performances justifiant son intégration dans les appareils de stockage grand public.In this work we elaborated a method for semantic segmentation of audiovisual content applicable for consumer electronics storage devices. For the specific solution we researched first a service-oriented distributed multimedia content analysis framework composed of individual content analysis modules, i.e. Service Units. One of the latter was dedicated to identify non-content related inserts, i.e. commercials blocks, which reached high performance results. In a subsequent step we researched and benchmarked various Shot Boundary Detectors and implement the best performing one as Service Unit. Here after, our study of production rules, i.e. film grammar, provided insights of Parallel Shot sequences, i.e. Cross-Cuttings and Shot-Reverse-Shots. We researched and benchmarked four similarity-based clustering methods, two colour- and two feature-point-based ones, in order to retain the best one for our final solution. Finally, we researched several audiovisual Scene Boundary Detector methods and achieved best results combining a colour-based method with a shot length based criteria. This Scene Boundary Detector identified semantic scene boundaries with a robustness of 66% for movies and 80% for series, which proofed to be sufficient for our envisioned application Advanced Content Navigation

    Commonplace Exchanges: New Documentary Networks and International Students

    Get PDF
    International students have to overcome language barriers, adapt to different cultures and lifestyles, and grapple with the loneliness of living far from home. This documentary is about “typical day” of four international students living and studying in Toronto. Including on-location shots and interviews, the footage was edited into different video formats, which were combined into a nonlinear interactive user-interface. This documentary project conveys some cultural complexities involved in going abroad; the documentary profiles are imbued with affective power and contain subtle details about environmental and cultural contexts. Being built on a website this project allows the audiences to add personal experiences via comments or video-responses and become documentary subjects. My thesis investigates how participatory online documentary can assist current/potential international students in gathering information about studying overseas. It also helps international student service professionals, including administrators at universities and study-abroad intermediaries, to better understand the unique challenges these students face

    Specialised Languages and Multimedia. Linguistic and Cross-cultural Issues

    Get PDF
    none2noThis book collects academic works focusing on scientific and technical discourse and on the ways in which this type of discourse appears in or is shaped by multimedia products. The originality of this book is to be seen in the variety of approaches used and of the specialised languages investigated in relation to multimodal and multimedia genres. Contributions will particularly focus on new multimodal or multimedia forms of specialised discourse (in institutional, academic, technical, scientific, social or popular settings), linguistic features of specialised discourse in multimodal or multimedia genres, the popularisation of specialised knowledge in multimodal or multimedia genres, the impact of multimodality and multimediality on the construction of scientific and technical discourse, the impact of multimodality/multimediality in the practice and teaching of language, the impact of multimodality/multimediality in the practice and teaching of translation, new multimedia modes of knowledge dissemination, the translation/adaptation of scientific discourse in multimedia products. This volume contributes to the theory and practice of multimodal studies and translation, with a specific focus on specialized discourse.Rivista di Classe A - Volume specialeopenManca E., Bianchi F.Manca, E.; Bianchi, F

    Wearable computing and contextual awareness

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 1999.Includes bibliographical references (leaves 231-248).Computer hardware continues to shrink in size and increase in capability. This trend has allowed the prevailing concept of a computer to evolve from the mainframe to the minicomputer to the desktop. Just as the physical hardware changes, so does the use of the technology, tending towards more interactive and personal systems. Currently, another physical change is underway, placing computational power on the user's body. These wearable machines encourage new applications that were formerly infeasible and, correspondingly, will result in new usage patterns. This thesis suggests that the fundamental improvement offered by wearable computing is an increased sense of user context. I hypothesize that on-body systems can sense the user's context with little or no assistance from environmental infrastructure. These body-centered systems that "see" as the user sees and "hear" as the user hears, provide a unique "first-person" viewpoint of the user's environment. By exploiting models recovered by these systems, interfaces are created which require minimal directed action or attention by the user. In addition, more traditional applications are augmented by the contextual information recovered by these systems. To investigate these issues, I provide perceptually sensible tools for recovering and modeling user context in a mobile, everyday environment. These tools include a downward-facing, camera-based system for establishing the location of the user; a tag-based object recognition system for augmented reality; and several on-body gesture recognition systems to identify various user tasks in constrained environments. To address the practicality of contextually-aware wearable computers, issues of power recovery, heat dissipation, and weight distribution are examined. In addition, I have encouraged a community of wearable computer users at the Media Lab through design, management, and support of hardware and software infrastructure. This unique community provides a heightened awareness of the use and social issues of wearable computing. As much as possible, the lessons from this experience will be conveyed in the thesis.by Thad Eugene Starner.Ph.D
    corecore