97 research outputs found
Recommended from our members
Characterizing Audio Events for Video Soundtrack Analysis
There is an entire emerging ecosystem of amateur video recordings on the internet today, in addition to the abundance of more professionally produced content. The ability to automatically scan and evaluate the content of these recordings would be very useful for search and indexing, especially as amateur content tends to be more poorly labeled and tagged than professional content. Although the visual content is often considered to be of primary importance, the audio modality contains rich information which may be very helpful in the context of video search and understanding. Any technology that could help to interpret video soundtrack data would also be applicable in a number of other scenarios, such as mobile device audio awareness, surveillance, and robotics. In this thesis we approach the problem of extracting information from these kinds of unconstrained audio recordings. Specifically we focus on techniques for characterizing discrete audio events within the soundtrack (e.g. a dog bark or door slam), since we expect events to be particularly informative about content. Our task is made more complicated by the extremely variable recording quality and noise present in this type of audio. Initially we explore the idea of using the matching pursuit algorithm to decompose and isolate components of audio events. Using these components we develop an approach for non-exact (approximate) fingerprinting as a way to search audio data for similar recurring events. We demonstrate a proof of concept for this idea. Subsequently we extend the use of matching pursuit to build an actual audio fingerprinting system, with the goal of identifying simultaneously recorded amateur videos (i.e. videos taken in the same place at the same time by different people, which contain overlapping audio). Automatic discovery of these simultaneous recordings is one particularly interesting facet of general video indexing. We evaluate this fingerprinting system on a database of 733 internet videos. Next we return to searching for features to directly characterize soundtrack events. We develop a system to detect transient sounds and represent audio clips as a histogram of the transients it contains. We use this representation for video classification over a database of 1873 internet videos. When we combine these features with a spectral feature baseline system we achieve a relative improvement of 7.5% in mean average precision over the baseline. In another attempt to devise features to better describe and compare events, we investigate decomposing audio using a convolutional form of non-negative matrix factorization, resulting in event-like spectro-temporal patches. We use the resulting representation to build an event detection system that is more robust to additive noise than a comparative baseline system. Lastly we investigate a promising feature representation that has been used by others previously to describe event-like sound effect clips. These features derive from an auditory model and are meant to capture fine time structure in sound events. We compare these features and a related but simpler feature set on the task of video classification over 9317 internet videos. We find that combinations of these features with baseline spectral features produce a significant improvement in mean average precision over the baseline
Robust digital watermarking techniques for multimedia protection
The growing problem of the unauthorized reproduction of digital multimedia data such as movies, television broadcasts, and similar digital products has triggered worldwide efforts to identify and protect multimedia contents. Digital watermarking technology provides law enforcement officials with a forensic tool for tracing and catching pirates. Watermarking refers to the process of adding a structure called a watermark to an original data object, which includes digital images, video, audio, maps, text messages, and 3D graphics. Such a watermark can be used for several purposes including copyright protection, fingerprinting, copy protection, broadcast monitoring, data authentication, indexing, and medical safety. The proposed thesis addresses the problem of multimedia protection and consists of three parts. In the first part, we propose new image watermarking algorithms that are robust against a wide range of intentional and geometric attacks, flexible in data embedding, and computationally fast. The core idea behind our proposed watermarking schemes is to use transforms that have different properties which can effectively match various aspects of the signal's frequencies. We embed the watermark many times in all the frequencies to provide better robustness against attacks and increase the difficulty of destroying the watermark. The second part of the thesis is devoted to a joint exploitation of the geometry and topology of 3D objects and its subsequent application to 3D watermarking. The key idea consists of capturing the geometric structure of a 3D mesh in the spectral domain by computing the eigen-decomposition of the mesh Laplacian matrix. We also use the fact that the global shape features of a 3D model may be reconstructed using small low-frequency spectral coefficients. The eigen-analysis of the mesh Laplacian matrix is, however, prohibitively expensive. To lift this limitation, we first partition the 3D mesh into smaller 3D sub-meshes, and then we repeat the watermark embedding process as much as possible in the spectral coefficients of the compressed 3D sub-meshes. The visual error of the watermarked 3D model is evaluated by computing a nonlinear visual error metric between the original 3D model and the watermarked model obtained by our proposed algorithm. The third part of the thesis is devoted to video watermarking. We propose robust, hybrid scene-based MPEG video watermarking techniques based on a high-order tensor singular value decomposition of the video image sequences. The key idea behind our approaches is to use the scene change analysis to embed the watermark repeatedly in a fixed number of the intra-frames. These intra-frames are represented as 3D tensors with two dimensions in space and one dimension in time. We embed the watermark information in the singular values of these high-order tensors, which have good stability and represent the video properties. Illustration of numerical experiments with synthetic and real data are provided to demonstrate the potential and the much improved performance of the proposed algorithms in multimedia watermarking
Algorithms for propagation-aware underwater ranging and localization
Mención Internacional en el título de doctorWhile oceans occupy most of our planet, their exploration and conservation are one of
the crucial research problems of modern time. Underwater localization stands among the
key issues on the way to the proper inspection and monitoring of this significant part of our
world. In this thesis, we investigate and tackle different challenges related to underwater
ranging and localization. In particular, we focus on algorithms that consider underwater
acoustic channel properties. This group of algorithms utilizes additional information
about the environment and its impact on acoustic signal propagation, in order to improve
the accuracy of location estimates, or to achieve a reduced complexity, or a reduced
amount of resources (e.g., anchor nodes) compared to traditional algorithms.
First, we tackle the problem of passive range estimation using the differences in the
times of arrival of multipath replicas of a transmitted acoustic signal. This is a costand
energy- effective algorithm that can be used for the localization of autonomous
underwater vehicles (AUVs), and utilizes information about signal propagation. We study
the accuracy of this method in the simplified case of constant sound speed profile (SSP)
and compare it to a more realistic case with various non-constant SSP. We also propose
an auxiliary quantity called effective sound speed. This quantity, when modeling acoustic
propagation via ray models, takes into account the difference between rectilinear and
non-rectilinear sound ray paths. According to our evaluation, this offers improved range
estimation results with respect to standard algorithms that consider the actual value of
the speed of sound.
We then propose an algorithm suitable for the non-invasive tracking of AUVs or
vocalizing marine animals, using only a single receiver. This algorithm evaluates the
underwater acoustic channel impulse response differences induced by a diverse sea
bottom profile, and proposes a computationally- and energy-efficient solution for passive
localization.
Finally, we propose another algorithm to solve the issue of 3D acoustic localization
and tracking of marine fauna. To reach the expected degree of accuracy, more sensors
are often required than are available in typical commercial off-the-shelf (COTS) phased
arrays found, e.g., in ultra short baseline (USBL) systems. Direct combination of multiple
COTS arrays may be constrained by array body elements, and lead to breaking the optimal array element spacing, or the desired array layout. Thus, the application of
state-of-the-art direction of arrival (DoA) estimation algorithms may not be possible. We
propose a solution for passive 3D localization and tracking using a wideband acoustic
array of arbitrary shape, and validate the algorithm in multiple experiments, involving
both active and passive targets.Part of the research in this thesis has been supported by the EU H2020 program under
project SYMBIOSIS (G.A. no. 773753).This work has been supported by IMDEA Networks InstitutePrograma de Doctorado en Ingeniería Telemática por la Universidad Carlos III de MadridPresidente: Paul Daniel Mitchell.- Secretario: Antonio Fernández Anta.- Vocal: Santiago Zazo Bell
End-to-end non-negative auto-encoders: a deep neural alternative to non-negative audio modeling
Over the last decade, non-negative matrix factorization (NMF) has emerged as one of the most popular approaches to modeling audio signals. NMF allows us to factorize the magnitude spectrogram to learn representative spectral bases that can be used for a wide range of applications. With the recent advances in deep learning, neural networks (NNs) have surpassed NMF in terms of performance. However, these NNs are trained discriminatively and lack several key characteristics like re-usability and robustness, compared to NMF.
In this dissertation, we develop and investigate the idea of end-to-end non-negative autoencoders (NAEs) as an updated deep learning based alternative framework to non-negative audio modeling. We show that end-to-end NAEs combine the modeling advantages of non-negative matrix factorization and the generalizability of neural networks while delivering significant improvements in performance.
To this end, we first interpret NMF as a NAE and show that the two approaches are equivalent semantically and in terms of source separation performance. We exploit the availability of sophisticated neural network architectures to propose several extensions to NAEs. We also demonstrate that these modeling improvements significantly boost the performance of NAEs.
In audio processing applications, the short-time fourier transform~(STFT) is used as a universal first step and we design algorithms and neural networks to operate on the magnitude spectrograms. We interpret the sequence of steps involved in computing the STFT as additional neural network layers. This enables us to propose end-to-end processing pipelines that operate directly on the raw waveforms. In the context of source separation, we show that end-to-end processing gives a significant improvement in performance compared to existing spectrogram based methods. Furthermore, to train these end-to-end models, we investigate the use of cost functions that are derived from objective evaluation metrics as measured on waveforms. We present subjective listening test results that reveal insights into the performance of these cost functions for end-to-end source separation.
Combining the adaptive front-end layers with NAEs, we propose end-to-end NAEs and show how they can be used for end-to-end generative source separation. Our experiments indicate that these models deliver separation performance comparable to that of discriminative NNs, while retaining the modularity of NMF and the modeling flexibility of neural networks. Finally, we present an approach to train these end-to-end NAEs using mixtures only, without access to clean training examples
Automatic removal of music tracks from tv programmes
This work pertains to in the research area of sound source separation. It deals with the problem of automatically removing musical segments from TV programmes. The dissertation proposes the utilisation of a pre-existant music recording, easily obtainable from o cially published CDs related to the audiovisual piece, as a reference for the undesired signal. The method is able to automatically detect small segments of the speci c musictrack spread among the whole audio signal of the programme, even if they appear with time-variable gain, or after having su ered linear distortions, such as being processed by equalization lters, or non-linear distortions, such as dynamic range compression. The project developed a quick-search algorithm using audio ngerprint techniques and hash-token data types to lower the algorithm complexity. The work also proposes the utilisation of a Wiener ltering technique to estimate potential equalization lter coe cients and uses a template matching algorithm to estimate time-variable gains to properly scale the musical segments to the correct amplitude they appear in the mixture. The key components of the separation system are presented, and a detailed description of all the algorithms involved is reported. Simulations with arti cial and real TV programme soundtracks are analysed and considerations about new future works are made. Furthermore, given the unique nature of this project, it is possible to say the dissertation is pioneer in the subject, becoming an ideal source of reference for other researchers that want to work in the area.Este trabalho está inserido na área de pesquisa de separação de fontes sonoras. Ele trata do problema de remover automaticamente segmentos de música de programas de TV. A tese propõe a utilização de uma gravação musical pré-existente, facilmente obtida em CDs oficialmente publicados relacionados à obra audiovisual, como referência para o sinal não desejado. O método é capaz de detectar automaticamente pequenos segmentos de uma trilha musical específica espalhados pelo sinal de áudio do programa, mesmo que eles apareçam com um ganho variante no tempo, ou tenham sofrido distorções lineares, como processamento por filtros equalizadores, ou distorções não lineares, como compressão de sua faixa dinâmica. O projeto desenvolveu um algoritmo de busca rápida usando técnicas de impressão digital de áudio e dados do tipo hash-token para diminuir a complexidade. O trabalho também propõe a utilização da técnica de filtragem de Wiener para estimar os coe cientes de um potencial filtro de equalização, e usa um algoritmo de template matching para estimar ganhos variantes no tempo para escalar corretamente os excertos musicais até a amplitude correta com que eles aparecem na mistura. Os componentes-chaves para o sistema de separação são apresentados, e uma descrição detalhada de todos os algoritmos envolvidos é reportada. Simulações com trilhas sonoras artificiais e de programas de TV reais são analisadas e considerações sobre novos trabalhos futuros são feitas. Além disso, dada a natureza única do projeto, é possível dizer que a dissertação é pioneira no assunto, tornando-se uma fonte de referência para outros pesquisadores que queiram trabalhar na área
Multi-channel approaches for musical audio content analysis
The goal of this research project is to undertake a critical evaluation of signal representations for musical audio content analysis. In particular it will contrast three different means for undertaking the analysis of micro-rhythmic content in Afro-Latin American music, namely through the use of: i) stereo or mono mixed recordings; ii) separated sources obtained via state of the art musical audio source separation techniques; and iii) the use of perfectly separated multi-track stems.
In total the project comprises the following four objectives: i) To compile a dataset of mixed and multi-channel recordings of the Brazilian Maracatu musicians; ii) To conceive methods for rhythmical micro-variations analysis and pattern recognition; iii) To explore diverse music source separation approaches that preserve micro-rhythmic content; iv) To evaluate the performance of several automatic onset estimation approaches; and v) To compare the rhythmic analysis obtained from the original multi-channel sources versus the separated ones to evaluate separation quality regarding microtiming identification
Data-Driven Sound Track Generation
Background music is often used to generate a specific atmosphere or to draw our attention to specific events. For example in movies or computer games it is often the accompanying music that conveys the emotional state of a scene and plays an important role for immersing the viewer or player into the virtual environment. In view of home-made videos, slide shows, and other consumer-generated visual media streams, there is a need for computer-assisted tools that allow users to generate aesthetically appealing music tracks in an easy and intuitive way. In this contribution, we consider a data-driven scenario where the musical raw material is given in form of a database containing a variety of audio recordings. Then, for a given visual media stream, the task consists in identifying, manipulating, overlaying, concatenating, and blending suitable music clips to generate a music stream that satisfies certain constraints imposed by the visual data stream and by user specifications. It is our main goal to give an overview of various content-based music processing and retrieval techniques that become important in data-driven sound track generation. In particular, we sketch a general pipeline that highlights how the various techniques act together and come into play when generating musically plausible transitions between subsequent music clips
Recommended from our members
Mining Audio
We have a new program in "Data to Solutions", and this was my introduction to the way that "big data" problems appear in audio - looking first at managing large music collections, then discussing the issues of video classification and retrieval by soundtrack features
- …