10 research outputs found
A Review of Audio Features and Statistical Models Exploited for Voice Pattern Design
Audio fingerprinting, also named as audio hashing, has been well-known as a
powerful technique to perform audio identification and synchronization. It
basically involves two major steps: fingerprint (voice pattern) design and
matching search. While the first step concerns the derivation of a robust and
compact audio signature, the second step usually requires knowledge about
database and quick-search algorithms. Though this technique offers a wide range
of real-world applications, to the best of the authors' knowledge, a
comprehensive survey of existing algorithms appeared more than eight years ago.
Thus, in this paper, we present a more up-to-date review and, for emphasizing
on the audio signal processing aspect, we focus our state-of-the-art survey on
the fingerprint design step for which various audio features and their
tractable statistical models are discussed.Comment: http://www.iaria.org/conferences2015/PATTERNS15.html ; Seventh
International Conferences on Pervasive Patterns and Applications (PATTERNS
2015), Mar 2015, Nice, Franc
Función resumen perceptual para verificación de integridad en audio forense
In this work we propose a function that allows to calculate a summary code from the parameters of a voice signal. This function is based on ordering of spectral coefficients obtained by means of the application of the Fast Fourier Transform (FFT), using a locally generated reference function (Gaussian random noise). The proposed method is oriented to the verification of integrity in forensic voice signals. The proposed methodology has a perceptual approach, which implies that the resulting code is maintained, even when modifications are made, particularly those that do not affect the sensitive content of the signal, such as re-quantization processes.En este artículos se propone una función que permite calcular un código resumen a partir de los parámetros de una señal de voz. Esta función está basada en el ordenamiento de los coeficientes espectrales en un proceso de imitación entre el espectro de la señal de voz y el espectro de una señal de ruido gaussiano generada localmente. El método de función resumen está orientado a la verificación de integridad forense en señales de voz, con un enfoque perceptual, que implica que la función resumen no cambia si la señal sufre modificaciones que no alteran el contenido (como re-cuantización), pero que si cambia ante modificaciones como recorte y adición de ruido. Se realizaron diversas pruebas para verificar el enfoque perceptual del método resumen propuesto y se compararon los resultados frente a modificaciones utilizando métodos tradicionales
Recommended from our members
Characterizing Audio Events for Video Soundtrack Analysis
There is an entire emerging ecosystem of amateur video recordings on the internet today, in addition to the abundance of more professionally produced content. The ability to automatically scan and evaluate the content of these recordings would be very useful for search and indexing, especially as amateur content tends to be more poorly labeled and tagged than professional content. Although the visual content is often considered to be of primary importance, the audio modality contains rich information which may be very helpful in the context of video search and understanding. Any technology that could help to interpret video soundtrack data would also be applicable in a number of other scenarios, such as mobile device audio awareness, surveillance, and robotics. In this thesis we approach the problem of extracting information from these kinds of unconstrained audio recordings. Specifically we focus on techniques for characterizing discrete audio events within the soundtrack (e.g. a dog bark or door slam), since we expect events to be particularly informative about content. Our task is made more complicated by the extremely variable recording quality and noise present in this type of audio. Initially we explore the idea of using the matching pursuit algorithm to decompose and isolate components of audio events. Using these components we develop an approach for non-exact (approximate) fingerprinting as a way to search audio data for similar recurring events. We demonstrate a proof of concept for this idea. Subsequently we extend the use of matching pursuit to build an actual audio fingerprinting system, with the goal of identifying simultaneously recorded amateur videos (i.e. videos taken in the same place at the same time by different people, which contain overlapping audio). Automatic discovery of these simultaneous recordings is one particularly interesting facet of general video indexing. We evaluate this fingerprinting system on a database of 733 internet videos. Next we return to searching for features to directly characterize soundtrack events. We develop a system to detect transient sounds and represent audio clips as a histogram of the transients it contains. We use this representation for video classification over a database of 1873 internet videos. When we combine these features with a spectral feature baseline system we achieve a relative improvement of 7.5% in mean average precision over the baseline. In another attempt to devise features to better describe and compare events, we investigate decomposing audio using a convolutional form of non-negative matrix factorization, resulting in event-like spectro-temporal patches. We use the resulting representation to build an event detection system that is more robust to additive noise than a comparative baseline system. Lastly we investigate a promising feature representation that has been used by others previously to describe event-like sound effect clips. These features derive from an auditory model and are meant to capture fine time structure in sound events. We compare these features and a related but simpler feature set on the task of video classification over 9317 internet videos. We find that combinations of these features with baseline spectral features produce a significant improvement in mean average precision over the baseline
Neural Networks for Analysing Music and Environmental Audio
PhDIn this thesis, we consider the analysis of music and environmental audio
recordings with neural networks. Recently, neural networks have been
shown to be an effective family of models for speech recognition, computer
vision, natural language processing and a number of other statistical modelling
problems. The composite layer-wise structure of neural networks
allows for flexible model design, where prior knowledge about the domain
of application can be used to inform the design and architecture of the
neural network models. Additionally, it has been shown that when trained
on sufficient quantities of data, neural networks can be directly applied to
low-level features to learn mappings to high level concepts like phonemes
in speech and object classes in computer vision. In this thesis we investigate
whether neural network models can be usefully applied to processing
music and environmental audio.
With regards to music signal analysis, we investigate 2 different problems.
The fi rst problem, automatic music transcription, aims to identify the
score or the sequence of musical notes that comprise an audio recording.
We also consider the problem of automatic chord transcription, where the
aim is to identify the sequence of chords in a given audio recording. For
both problems, we design neural network acoustic models which are applied
to low-level time-frequency features in order to detect the presence of
notes or chords. Our results demonstrate that the neural network acoustic
models perform similarly to state-of-the-art acoustic models, without the
need for any feature engineering. The networks are able to learn complex
transformations from time-frequency features to the desired outputs, given
sufficient amounts of training data. Additionally, we use recurrent neural
networks to model the temporal structure of sequences of notes or chords,
similar to language modelling in speech. Our results demonstrate that
the combination of the acoustic and language model predictions yields
improved performance over the acoustic models alone. We also observe
that convolutional neural networks yield better performance compared to
other neural network architectures for acoustic modelling.
For the analysis of environmental audio recordings, we consider the problem
of acoustic event detection. Acoustic event detection has a similar
structure to automatic music and chord transcription, where the system
is required to output the correct sequence of semantic labels along with
onset and offset times. We compare the performance of neural network
architectures against Gaussian mixture models and support vector machines.
In order to account for the fact that such systems are typically
deployed on embedded devices, we compare performance as a function of
the computational cost of each model. We evaluate the models on 2 large
datasets of real-world recordings of baby cries and smoke alarms. Our results
demonstrate that the neural networks clearly outperform the other
models and they are able to do so without incurring a heavy computation
cost
Modeling High-Dimensional Audio Sequences with Recurrent Neural Networks
Cette thèse étudie des modèles de séquences de haute dimension basés sur des réseaux de neurones récurrents (RNN) et leur application à la musique et à la parole. Bien qu'en principe les RNN puissent représenter les dépendances à long terme et la dynamique temporelle complexe propres aux séquences d'intérêt comme la vidéo, l'audio et la langue naturelle, ceux-ci n'ont pas été utilisés à leur plein potentiel depuis leur introduction par Rumelhart et al. (1986a) en raison de la difficulté de les entraîner efficacement par descente de gradient. Récemment, l'application fructueuse de l'optimisation Hessian-free et d'autres techniques d'entraînement avancées ont entraîné la recrudescence de leur utilisation dans plusieurs systèmes de l'état de l'art. Le travail de cette thèse prend part à ce développement.
L'idée centrale consiste à exploiter la flexibilité des RNN pour apprendre une description probabiliste de séquences de symboles, c'est-à-dire une information de haut niveau associée aux signaux observés, qui en retour pourra servir d'à priori pour améliorer la précision de la recherche d'information. Par exemple, en modélisant l'évolution de groupes de notes dans la musique polyphonique, d'accords dans une progression harmonique, de phonèmes dans un énoncé oral ou encore de sources individuelles dans un mélange audio, nous pouvons améliorer significativement les méthodes de transcription polyphonique, de reconnaissance d'accords, de reconnaissance de la parole et de séparation de sources audio respectivement. L'application pratique de nos modèles à ces tâches est détaillée dans les quatre derniers articles présentés dans cette thèse.
Dans le premier article, nous remplaçons la couche de sortie d'un RNN par des machines de Boltzmann restreintes conditionnelles pour décrire des distributions de sortie multimodales beaucoup plus riches. Dans le deuxième article, nous évaluons et proposons des méthodes avancées pour entraîner les RNN. Dans les quatre derniers articles, nous examinons différentes façons de combiner nos modèles symboliques à des réseaux profonds et à la factorisation matricielle non-négative, notamment par des produits d'experts, des architectures entrée/sortie et des cadres génératifs généralisant les modèles de Markov cachés. Nous proposons et analysons également des méthodes d'inférence efficaces pour ces modèles, telles la recherche vorace chronologique, la recherche en faisceau à haute dimension, la recherche en faisceau élagué et la descente de gradient. Finalement, nous abordons les questions de l'étiquette biaisée, du maître imposant, du lissage temporel, de la régularisation et du pré-entraînement.This thesis studies models of high-dimensional sequences based on recurrent neural networks (RNNs) and their application to music and speech. While in principle RNNs can represent the long-term dependencies and complex temporal dynamics present in real-world sequences such as video, audio and natural language, they have not been used to their full potential since their introduction by Rumelhart et al. (1986a) due to the difficulty to train them efficiently by gradient-based optimization. In recent years, the successful application of Hessian-free optimization and other advanced training techniques motivated an increase of their use in many state-of-the-art systems. The work of this thesis is part of this development.
The main idea is to exploit the power of RNNs to learn a probabilistic description of sequences of symbols, i.e. high-level information associated with observed signals, that in turn can be used as a prior to improve the accuracy of information retrieval. For example, by modeling the evolution of note patterns in polyphonic music, chords in a harmonic progression, phones in a spoken utterance, or individual sources in an audio mixture, we can improve significantly the accuracy of polyphonic transcription, chord recognition, speech recognition and audio source separation respectively. The practical application of our models to these tasks is detailed in the last four articles presented in this thesis.
In the first article, we replace the output layer of an RNN with conditional restricted Boltzmann machines to describe much richer multimodal output distributions. In the second article, we review and develop advanced techniques to train RNNs. In the last four articles, we explore various ways to combine our symbolic models with deep networks and non-negative matrix factorization algorithms, namely using products of experts, input/output architectures, and generative frameworks that generalize hidden Markov models. We also propose and analyze efficient inference procedures for those models, such as greedy chronological search, high-dimensional beam search, dynamic programming-like pruned beam search and gradient descent. Finally, we explore issues such as label bias, teacher forcing, temporal smoothing, regularization and pre-training
Proceedings of the 7th Sound and Music Computing Conference
Proceedings of the SMC2010 - 7th Sound and Music Computing Conference, July 21st - July 24th 2010
Recommender systems in industrial contexts
This thesis consists of four parts: - An analysis of the core functions and
the prerequisites for recommender systems in an industrial context: we identify
four core functions for recommendation systems: Help do Decide, Help to
Compare, Help to Explore, Help to Discover. The implementation of these
functions has implications for the choices at the heart of algorithmic
recommender systems. - A state of the art, which deals with the main techniques
used in automated recommendation system: the two most commonly used algorithmic
methods, the K-Nearest-Neighbor methods (KNN) and the fast factorization
methods are detailed. The state of the art presents also purely content-based
methods, hybridization techniques, and the classical performance metrics used
to evaluate the recommender systems. This state of the art then gives an
overview of several systems, both from academia and industry (Amazon, Google
...). - An analysis of the performances and implications of a recommendation
system developed during this thesis: this system, Reperio, is a hybrid
recommender engine using KNN methods. We study the performance of the KNN
methods, including the impact of similarity functions used. Then we study the
performance of the KNN method in critical uses cases in cold start situation. -
A methodology for analyzing the performance of recommender systems in
industrial context: this methodology assesses the added value of algorithmic
strategies and recommendation systems according to its core functions.Comment: version 3.30, May 201