2,166 research outputs found

    Audio Fingerprinting to Identify Multiple Videos of an Event

    Get PDF
    The proliferation of consumer recording devices and video sharing websites makes the possibility of having access to multiple recordings of the same occurrence increasingly likely. These co-synchronous recordings can be identified via their audio tracks, despite local noise and channel variations. We explore a robust fingerprinting strategy to do this. Matching pursuit is used to obtain a sparse set of the most prominent elements in a video soundtrack. Pairs of these elements are hashed and stored, to be efficiently compared with one another. This fingerprinting is tested on a corpus of over 700 YouTube videos related to the 2009 U.S. presidential inauguration. Reliable matching of identical events in different recordings is demonstrated, even under difficult conditions

    A Review of Audio Features and Statistical Models Exploited for Voice Pattern Design

    Full text link
    Audio fingerprinting, also named as audio hashing, has been well-known as a powerful technique to perform audio identification and synchronization. It basically involves two major steps: fingerprint (voice pattern) design and matching search. While the first step concerns the derivation of a robust and compact audio signature, the second step usually requires knowledge about database and quick-search algorithms. Though this technique offers a wide range of real-world applications, to the best of the authors' knowledge, a comprehensive survey of existing algorithms appeared more than eight years ago. Thus, in this paper, we present a more up-to-date review and, for emphasizing on the audio signal processing aspect, we focus our state-of-the-art survey on the fingerprint design step for which various audio features and their tractable statistical models are discussed.Comment: http://www.iaria.org/conferences2015/PATTERNS15.html ; Seventh International Conferences on Pervasive Patterns and Applications (PATTERNS 2015), Mar 2015, Nice, Franc

    The problems and challenges of managing crowd sourced audio-visual evidence

    Get PDF
    A number of recent incidents, such as the Stanley Cup Riots, the uprisings in the Middle East and the London riots have demonstrated the value of crowd sourced audio-visual evidence wherein citizens submit audio-visual footage captured on mobile phones and other devices to aid governmental institutions, responder agencies and law enforcement authorities to confirm the authenticity of incidents and, in the case of criminal activity, to identify perpetrators. The use of such evidence can present a significant logistical challenge to investigators, particularly because of the potential size of data gathered through such mechanisms and the added problems of time-lining disparate sources of evidence and, subsequently, investigating the incident(s). In this paper we explore this problem and, in particular, outline the pressure points for an investigator. We identify and explore a number of particular problems related to the secure receipt of the evidence, imaging, tagging and then time-lining the evidence, and the problem of identifying duplicate and near duplicate items of audio-visual evidence

    Identification, synchronisation and composition of user-generated videos

    Get PDF
    Cotutela Universitat Politècnica de Catalunya i Queen Mary University of LondonThe increasing availability of smartphones is facilitating people to capture videos of their experience when attending events such as concerts, sports competitions and public rallies. Smartphones are equipped with inertial sensors which could be beneficial for event understanding. The captured User-Generated Videos (UGVs) are made available on media sharing websites. Searching and mining of UGVs of the same event are challenging due to inconsistent tags or incorrect timestamps. A UGV recorded from a fixed location contains monotonic content and unintentional camera motions, which may make it less interesting to playback. In this thesis, we propose the following identification, synchronisation and video composition frameworks for UGVs. We propose a framework for the automatic identification and synchronisation of unedited multi-camera UGVs within a database. The proposed framework analyses the sound to match and cluster UGVs that capture the same spatio-temporal event, and estimate their relative time-shift to temporally align them. We design a novel descriptor derived from the pairwise matching of audio chroma features of UGVs. The descriptor facilitates the definition of a classification threshold for automatic query-by-example event identification. We contribute a database of 263 multi-camera UGVs of 48 real-world events. We evaluate the proposed framework on this database and compare it with state-of-the-art methods. Experimental results show the effectiveness of the proposed approach in the presence of audio degradations (channel noise, ambient noise, reverberations). Moreover, we present an automatic audio and visual-based camera selection framework for composing uninterrupted recording from synchronised multi-camera UGVs of the same event. We design an automatic audio-based cut-point selection method that provides a common reference for audio and video segmentation. To filter low quality video segments, spatial and spatio-temporal assessments are computed. The framework combines segments of UGVs using a rank-based camera selection strategy by considering visual quality scores and view diversity. The proposed framework is validated on a dataset of 13 events (93~UGVs) through subjective tests and compared with state-of-the-art methods. Suitable cut-point selection, specific visual quality assessments and rank-based camera selection contribute to the superiority of the proposed framework over the existing methods. Finally, we contribute a method for Camera Motion Detection using Gyroscope for UGVs captured from smartphones and design a gyro-based quality score for video composition. The gyroscope measures the angular velocity of the smartphone that can be use for camera motion analysis. We evaluate the proposed camera motion detection method on a dataset of 24 multi-modal UGVs captured by us, and compare it with existing visual and inertial sensor-based methods. By designing a gyro-based score to quantify the goodness of the multi-camera UGVs, we develop a gyro-based video composition framework. A gyro-based score substitutes the spatial and spatio-temporal scores and reduces the computational complexity. We contribute a multi-modal dataset of 3 events (12~UGVs), which is used to validate the proposed gyro-based video composition framework.El incremento de la disponibilidad de teléfonos inteligentes o smartphones posibilita a la gente capturar videos de sus experiencias cuando asisten a eventos así como como conciertos, competiciones deportivas o mítines públicos. Los Videos Generados por Usuarios (UGVs) pueden estar disponibles en sitios web públicos especializados en compartir archivos. La búsqueda y la minería de datos de los UGVs del mismo evento son un reto debido a que los etiquetajes son incoherentes o las marcas de tiempo erróneas. Por otra parte, un UGV grabado desde una ubicación fija, contiene información monótona y movimientos de cámara no intencionados haciendo menos interesante su reproducción. En esta tesis, se propone una identificación, sincronización y composición de tramas de vídeo para UGVs. Se ha propuesto un sistema para la identificación y sincronización automática de UGVs no editados provenientes de diferentes cámaras dentro de una base de datos. El sistema propuesto analiza el sonido con el fin de hacerlo coincidir e integrar UGVs que capturan el mismo evento en el espacio y en el tiempo, estimando sus respectivos desfases temporales y alinearlos en el tiempo. Se ha diseñado un nuevo descriptor a partir de la coincidencia por parejas de características de la croma del audio de los UGVs. Este descriptor facilita la determinación de una clasificación por umbral para una identificación de eventos automática basada en búsqueda mediante ejemplo (en inglés, query by example). Se ha contribuido con una base de datos de 263 multi-cámaras UGVs de un total de 48 eventos reales. Se ha evaluado la trama propuesta en esta base de datos y se ha comparado con los métodos elaborados en el estado del arte. Los resultados experimentales muestran la efectividad del enfoque propuesto con la presencia alteraciones en el audio. Además, se ha presentado una selección automática de tramas en base a la reproducción de video y audio componiendo una grabación ininterrumpida de multi-cámaras UGVs sincronizadas en el mismo evento. También se ha diseñado un método de selección de puntos de corte automático basado en audio que proporciona una referencia común para la segmentación de audio y video. Con el fin de filtrar segmentos de videos de baja calidad, se han calculado algunas medidas espaciales y espacio-temporales. El sistema combina segmentos de UGVs empleando una estrategia de selección de cámaras basadas en la evaluación a través de un ranking considerando puntuaciones de calidad visuales y diversidad de visión. El sistema propuesto se ha validado con un conjunto de datos de 13 eventos (93 UGVs) a través de pruebas subjetivas y se han comparado con los métodos elaborados en el estado del arte. La selección de puntos de corte adecuados, evaluaciones de calidad visual específicas y la selección de cámara basada en ranking contribuyen en la mejoría de calidad del sistema propuesto respecto a otros métodos existentes. Finalmente, se ha realizado un método para la Detección de Movimiento de Cámara usando giróscopos para las UGVs capturadas desde smartphones y se ha diseñado un método de puntuación de calidad basada en el giro. El método de detección de movimiento de la cámara con una base de datos de 24 UGVs multi-modales y se ha comparado con los métodos actuales basados en visión y sistemas inerciales. A través del diseño de puntuación para cuantificar con el giróscopo cuán bien funcionan los sistemas de UGVs con multi-cámara, se ha desarrollado un sistema de composición de video basada en el movimiento del giroscopio. Este sistema basado en la puntuación a través del giróscopo sustituye a los sistemas de puntuaciones basados en parámetros espacio-temporales reduciendo la complejidad computacional. Además, se ha contribuido con un conjunto de datos de 3 eventos (12 UGVs), que se han empleado para validar los sistemas de composición de video basados en giróscopo.Postprint (published version

    MediaSync: Handbook on Multimedia Synchronization

    Get PDF
    This book provides an approachable overview of the most recent advances in the fascinating field of media synchronization (mediasync), gathering contributions from the most representative and influential experts. Understanding the challenges of this field in the current multi-sensory, multi-device, and multi-protocol world is not an easy task. The book revisits the foundations of mediasync, including theoretical frameworks and models, highlights ongoing research efforts, like hybrid broadband broadcast (HBB) delivery and users' perception modeling (i.e., Quality of Experience or QoE), and paves the way for the future (e.g., towards the deployment of multi-sensory and ultra-realistic experiences). Although many advances around mediasync have been devised and deployed, this area of research is getting renewed attention to overcome remaining challenges in the next-generation (heterogeneous and ubiquitous) media ecosystem. Given the significant advances in this research area, its current relevance and the multiple disciplines it involves, the availability of a reference book on mediasync becomes necessary. This book fills the gap in this context. In particular, it addresses key aspects and reviews the most relevant contributions within the mediasync research space, from different perspectives. Mediasync: Handbook on Multimedia Synchronization is the perfect companion for scholars and practitioners that want to acquire strong knowledge about this research area, and also approach the challenges behind ensuring the best mediated experiences, by providing the adequate synchronization between the media elements that constitute these experiences

    Automated Organisation and Quality Analysis of User-Generated Audio Content

    Get PDF
    The abundance and ubiquity of user-generated content has opened horizons when it comes to the organization and analysis of vast and heterogeneous data, especially with the increase of quality of the recording devices witnessed nowadays. Most of the activity experienced in social networks today contains audio excerpts, either by belonging to a certain video file or an actual audio clip, therefore the analysis of the audio features present in such content is of extreme importance in order to better understand it. Such understanding would lead to a better handling of ubiquity data and would ultimately provide a better experience to the end-user. The work discussed in this thesis revolves around using audio features to organize and retrieve meaningful insights from user-generated content crawled from social media websites, more particularly data related to concert clips. From its redundancy and abundance (i.e., for the existence of several recordings of a given event), recordings from musical shows represent a very good use case to derive useful and practical conclusions around the scope of this thesis. Mechanisms that provide a better understanding of such content are presented and already partly implemented, such as audio clustering based on the existence of overlapping audio segments between different audio clips, audio segmentation that synchronizes and relates the different cluster’s clips in time, and techniques to infer audio quality of such clips. All the proposed methods use information retrieved from an audio fingerprinting algorithm, used for the synchronization of the different audio files, with methods for filtering possible false positives of the algorithm being also presented. For the evaluation and validation of the proposed methods, we used one dataset made of several audio recordings regarding different concert clips manually crawled from YouTube

    Making a Scene: Alignment of Complete Sets of Clips Based on Pairwise Audio Match

    Get PDF
    As the amount of social video content captured at physical-world events, and shared online, is rapidly increasing, there is a growing need for robust methods for organization and presentation of the captured content. In this work, we significantly extend prior work that examined automatic detection of videos from events that were captured at the same time, i.e. "overlapping". We go beyond finding pairwise matches between video clips and describe the construction of scenes, or sets of multiple overlapping videos, each scene presenting a coherent moment in the event. We test multiple strategies for scene construction, using a greedy algorithm to create a mapping of videos into scenes, and a clustering refinement step to increase the precision of each scene. We evaluate the strategies in multiple settings and show that a greedy and clustering approach results in best possible balance between recall and precision for all settings
    corecore