10,174 research outputs found
Resynthesis of Acoustic Scenes Combining Sound Source Separation and WaveField Synthesis Techniques
[ES] La Separacón de Fuentes ha sido un tema de intensa investigación en muchas aplicaciones de tratamiento de señaal, cubriendo desde el procesado de voz al análisis de im'agenes biomédicas. Aplicando estas técnicas a los sistemas de reproducci'on espacial de audio, se puede solucionar una limitaci ón importante en la resíntesis de escenas sonoras 3D: la necesidad de disponer de las se ñales individuales correspondientes a cada fuente. El sistema Wave-field Synthesis (WFS) puede sintetizar un campo acústico mediante arrays de altavoces, posicionando varias fuentes en el espacio. Sin embargo, conseguir las señales de cada fuente de forma independiente es normalmente un problema. En este trabajo se propone la utilización de distintas técnicas de separaci'on de fuentes sonoras para obtener distintas pistas a partir de grabaciones mono o estéreo. Varios métodos de separación han sido implementados y comprobados, siendo uno de ellos desarrollado por el autor. Aunque los algoritmos existentes están lejos de conseguir una alta calidad, se han realizado tests subjetivos que demuestran cómo no es necesario obtener una separación óptima para conseguir resultados aceptables en la reproducción de escenas 3D[EN] Source Separation has been a subject of intense research in many signal processing applications, ranging
from speech processing to medical image analysis. Applied to spatial audio systems, it can be used to
overcome one fundamental limitation in 3D scene resynthesis: the need of having the independent
signals for each source available. Wave-field Synthesis is a spatial sound reproduction system that can
synthesize an acoustic field by means of loudspeaker arrays and it is also capable of positioning several
sources in space. However, the individual signals corresponding to these sources must be available and
this is often a difficult problem. In this work, we propose to use Sound Source Separation techniques
in order to obtain different tracks from stereo and mono mixtures. Some separation methods have
been implemented and tested, having been one of them developed by the author. Although existing
algorithms are far from getting hi-fi quality, subjective tests show how it is not necessary an optimum
separation for getting acceptable results in 3D scene reproductionCobos Serrano, M. (2007). Resynthesis of Acoustic Scenes Combining Sound Source Separation and WaveField Synthesis Techniques. http://hdl.handle.net/10251/12515Archivo delegad
Evolving Multi-Resolution Pooling CNN for Monaural Singing Voice Separation
Monaural Singing Voice Separation (MSVS) is a challenging task and has been
studied for decades. Deep neural networks (DNNs) are the current
state-of-the-art methods for MSVS. However, the existing DNNs are often
designed manually, which is time-consuming and error-prone. In addition, the
network architectures are usually pre-defined, and not adapted to the training
data. To address these issues, we introduce a Neural Architecture Search (NAS)
method to the structure design of DNNs for MSVS. Specifically, we propose a new
multi-resolution Convolutional Neural Network (CNN) framework for MSVS namely
Multi-Resolution Pooling CNN (MRP-CNN), which uses various-size pooling
operators to extract multi-resolution features. Based on the NAS, we then
develop an evolving framework namely Evolving MRP-CNN (E-MRP-CNN), by
automatically searching the effective MRP-CNN structures using genetic
algorithms, optimized in terms of a single-objective considering only
separation performance, or multi-objective considering both the separation
performance and the model complexity. The multi-objective E-MRP-CNN gives a set
of Pareto-optimal solutions, each providing a trade-off between separation
performance and model complexity. Quantitative and qualitative evaluations on
the MIR-1K and DSD100 datasets are used to demonstrate the advantages of the
proposed framework over several recent baselines
Dictionary Learning for Sparse Representations With Applications to Blind Source Separation.
During the past decade, sparse representation has attracted much attention in the signal processing community. It aims to represent a signal as a linear combination of a small number of elementary signals called atoms. These atoms constitute a dictionary so that a signal can be expressed by the multiplication of the dictionary and a sparse coefficients vector. This leads to two main challenges that are studied in the literature, i.e. sparse coding (find the coding coefficients based on a given dictionary) and dictionary design (find an appropriate dictionary to fit the data). Dictionary design is the focus of this thesis. Traditionally, the signals can be decomposed by the predefined mathematical transform, such as discrete cosine transform (DCT), which forms the so-called analytical approach. In recent years, learning-based methods have been introduced to adapt the dictionary from a set of training data, leading to the technique of dictionary learning. Although this may involve a higher computational complexity, learned dictionaries have the potential to offer improved performance as compared with predefined dictionaries. Dictionary learning algorithm is often achieved by iteratively executing two operations: sparse approximation and dictionary update. We focus on the dictionary update step, where the dictionary is optimized with a given sparsity pattern. A novel framework is proposed to generalize benchmark mechanisms such as the method of optimal directions (MOD) and K-SVD where an arbitrary set of codewords and the corresponding sparse coefficients are simultaneously updated, hence the term simultaneous codeword optimization (SimCO). Moreover, its extended formulation ‘regularized SimCO’ mitigates the major bottleneck of dictionary update caused by the singular points. First and second order optimization procedures are designed to solve the primitive and regularized SimCO. In addition, a tree-structured multi-level representation of dictionary based on clustering is used to speed up the optimization process in the sparse coding stage. This novel dictionary learning algorithm is also applied for solving the underdetermined blind speech separation problem, leading to a multi-stage method, where the separation problem is reformulated as a sparse coding problem, with the dictionary being learned by an adaptive algorithm. Using mutual coherence and sparsity index, the performance of a variety of dictionaries for underdetermined speech separation is compared and analyzed, such as the dictionaries learned from speech mixtures and ground truth speech sources, as well as those predefined by mathematical transforms. Finally, we propose a new method for joint dictionary learning and source separation. Different from the multistage method, the proposed method can simultaneously estimate the mixing matrix, the dictionary and the sources in an alternating and blind manner. The advantages of all the proposed methods are demonstrated over the state-of-the-art methods using extensive numerical tests
Biologically-Plausible Determinant Maximization Neural Networks for Blind Separation of Correlated Sources
Extraction of latent sources of complex stimuli is critical for making sense
of the world. While the brain solves this blind source separation (BSS) problem
continuously, its algorithms remain unknown. Previous work on
biologically-plausible BSS algorithms assumed that observed signals are linear
mixtures of statistically independent or uncorrelated sources, limiting the
domain of applicability of these algorithms. To overcome this limitation, we
propose novel biologically-plausible neural networks for the blind separation
of potentially dependent/correlated sources. Differing from previous work, we
assume some general geometric, not statistical, conditions on the source
vectors allowing separation of potentially dependent/correlated sources.
Concretely, we assume that the source vectors are sufficiently scattered in
their domains which can be described by certain polytopes. Then, we consider
recovery of these sources by the Det-Max criterion, which maximizes the
determinant of the output correlation matrix to enforce a similar spread for
the source estimates. Starting from this normative principle, and using a
weighted similarity matching approach that enables arbitrary linear
transformations adaptable by local learning rules, we derive two-layer
biologically-plausible neural network algorithms that can separate mixtures
into sources coming from a variety of source domains. We demonstrate that our
algorithms outperform other biologically-plausible BSS algorithms on correlated
source separation problems.Comment: NeurIPS 2022, 37 page
Pitch-Informed Solo and Accompaniment Separation
Das Thema dieser Dissertation ist die Entwicklung eines Systems zur
Tonhöhen-informierten Quellentrennung von Musiksignalen in Soloinstrument
und Begleitung. Dieses ist geeignet, die dominanten Instrumente aus einem
Musikstück zu isolieren, unabhängig von der Art des Instruments, der
Begleitung und Stilrichtung. Dabei werden nur einstimmige
Melodieinstrumente in Betracht gezogen. Die Musikaufnahmen liegen monaural
vor, es kann also keine zusätzliche Information aus der Verteilung der
Instrumente im Stereo-Panorama gewonnen werden.
Die entwickelte Methode nutzt Tonhöhen-Information als Basis für eine
sinusoidale Modellierung der spektralen Eigenschaften des Soloinstruments
aus dem Musikmischsignal. Anstatt die spektralen Informationen pro Frame zu
bestimmen, werden in der vorgeschlagenen Methode Tonobjekte für die
Separation genutzt. Tonobjekt-basierte Verarbeitung ermöglicht es,
zusätzlich die Notenanfänge zu verfeinern, transiente Artefakte zu
reduzieren, gemeinsame Amplitudenmodulation (Common Amplitude Modulation
CAM) einzubeziehen und besser nichtharmonische Elemente der Töne
abzuschätzen. Der vorgestellte Algorithmus zur Quellentrennung von
Soloinstrument und Begleitung ermöglicht eine Echtzeitverarbeitung und ist
somit relevant für den praktischen Einsatz.
Ein Experiment zur besseren Modellierung der Zusammenhänge zwischen
Magnitude, Phase und Feinfrequenz von isolierten Instrumententönen wurde
durchgeführt. Als Ergebnis konnte die Kontinuität der zeitlichen
Einhüllenden, die Inharmonizität bestimmter Musikinstrumente und die
Auswertung des Phasenfortschritts für die vorgestellte Methode ausgenutzt
werden. Zusätzlich wurde ein Algorithmus für die Quellentrennung in
perkussive und harmonische Signalanteile auf Basis des Phasenfortschritts
entwickelt. Dieser erreicht ein verbesserte perzeptuelle Qualität der
harmonischen und perkussiven Signale gegenüber vergleichbaren Methoden nach
dem Stand der Technik.
Die vorgestellte Methode zur Klangquellentrennung in Soloinstrument und
Begleitung wurde zu den Evaluationskampagnen SiSEC 2011 und SiSEC 2013
eingereicht. Dort konnten vergleichbare Ergebnisse im Hinblick auf
perzeptuelle Bewertungsmaße erzielt werden. Die Qualität eines
Referenzalgorithmus im Hinblick auf den in dieser Dissertation
beschriebenen Instrumentaldatensatz übertroffen werden.
Als ein Anwendungsszenario für die Klangquellentrennung in Solo und
Begleitung wurde ein Hörtest durchgeführt, der die Qualitätsanforderungen
an Quellentrennung im Kontext von Musiklernsoftware bewerten sollte. Die
Ergebnisse dieses Hörtests zeigen, dass die Solo- und Begleitspur gemäß
unterschiedlicher Qualitätskriterien getrennt werden sollten. Die
Musiklernsoftware Songs2See integriert die vorgestellte
Klangquellentrennung bereits in einer kommerziell erhältlichen Anwendung.This thesis addresses the development of a system for pitch-informed solo
and accompaniment separation capable of separating main instruments from
music accompaniment regardless of the musical genre of the track, or type
of music accompaniment. For the solo instrument, only pitched monophonic
instruments were considered in a single-channel scenario where no panning
or spatial location information is available.
In the proposed method, pitch information is used as an initial stage of a
sinusoidal modeling approach that attempts to estimate the spectral
information of the solo instrument from a given audio mixture. Instead of
estimating the solo instrument on a frame by frame basis, the proposed
method gathers information of tone objects to perform separation.
Tone-based processing allowed the inclusion of novel processing stages for
attack refinement, transient interference reduction, common amplitude
modulation (CAM) of tone objects, and for better estimation of non-harmonic
elements that can occur in musical instrument tones. The proposed solo and
accompaniment algorithm is an efficient method suitable for real-world
applications.
A study was conducted to better model magnitude, frequency, and phase of
isolated musical instrument tones. As a result of this study, temporal
envelope smoothness, inharmonicty of musical instruments, and phase
expectation were exploited in the proposed separation method. Additionally,
an algorithm for harmonic/percussive separation based on phase expectation
was proposed. The algorithm shows improved perceptual quality with respect
to state-of-the-art methods for harmonic/percussive separation.
The proposed solo and accompaniment method obtained perceptual quality
scores comparable to other state-of-the-art algorithms under the SiSEC 2011
and SiSEC 2013 campaigns, and outperformed the comparison algorithm on the
instrumental dataset described in this thesis.As a use-case of solo and
accompaniment separation, a listening test procedure was conducted to
assess separation quality requirements in the context of music education.
Results from the listening test showed that solo and accompaniment tracks
should be optimized differently to suit quality requirements of music
education. The Songs2See application was presented as commercial music
learning software which includes the proposed solo and accompaniment
separation method
Martian time-series unraveled: A multi-scale nested approach with factorial variational autoencoders
Unsupervised source separation involves unraveling an unknown set of source
signals recorded through a mixing operator, with limited prior knowledge about
the sources, and only access to a dataset of signal mixtures. This problem is
inherently ill-posed and is further challenged by the variety of time-scales
exhibited by sources in time series data. Existing methods typically rely on a
preselected window size that limits their capacity to handle multi-scale
sources. To address this issue, instead of operating in the time domain, we
propose an unsupervised multi-scale clustering and source separation framework
by leveraging wavelet scattering covariances that provide a low-dimensional
representation of stochastic processes, capable of distinguishing between
different non-Gaussian stochastic processes. Nested within this representation
space, we develop a factorial Gaussian-mixture variational autoencoder that is
trained to (1) probabilistically cluster sources at different time-scales and
(2) independently sample scattering covariance representations associated with
each cluster. Using samples from each cluster as prior information, we
formulate source separation as an optimization problem in the wavelet
scattering covariance representation space, resulting in separated sources in
the time domain. When applied to seismic data recorded during the NASA InSight
mission on Mars, our multi-scale nested approach proves to be a powerful tool
for discriminating between sources varying greatly in time-scale, e.g.,
minute-long transient one-sided pulses (known as ``glitches'') and structured
ambient noises resulting from atmospheric activities that typically last for
tens of minutes. These results provide an opportunity to conduct further
investigations into the isolated sources related to atmospheric-surface
interactions, thermal relaxations, and other complex phenomena
Generative rhythmic models
A system for generative rhythmic modeling is presented. The work aims to explore computational models of creativity, realizing them in a system designed for realtime generation of semi-improvisational music. This is envisioned as an attempt to develop musical intelligence in the context of structured improvisation, and by doing so to enable and encourage new forms of musical control and performance; the systems described in this work, already capable of realtime creation, have been designed with the explicit intention of embedding them in a variety of performance-based systems. A model of qaida, a solo tabla form, is presented, along with the results of an online survey comparing it to a professional tabla player's recording on dimensions of musicality, creativity, and novelty. The qaida model generates a bank of rhythmic variations by reordering subphrases. Selections from this bank are sequenced using a feature-based approach. An experimental extension into modeling layer- and loop-based forms of electronic music is presented, in which the initial modeling approach is generalized. Starting from a seed track, the layer-based model utilizes audio analysis techniques such as blind source separation and onset-based segmentation to generate layers which are shuffled and recombined to generate novel music in a manner analogous to the qaida model.M.S.Committee Chair: Chordia, Parag; Committee Member: Freeman, Jason; Committee Member: Weinberg, Gi
VenoMave: Targeted Poisoning Against Speech Recognition
The wide adoption of Automatic Speech Recognition (ASR) remarkably enhanced
human-machine interaction. Prior research has demonstrated that modern ASR
systems are susceptible to adversarial examples, i.e., malicious audio inputs
that lead to misclassification by the victim's model at run time. The research
question of whether ASR systems are also vulnerable to data-poisoning attacks
is still unanswered. In such an attack, a manipulation happens during the
training phase: an adversary injects malicious inputs into the training set to
compromise the neural network's integrity and performance. Prior work in the
image domain demonstrated several types of data-poisoning attacks, but these
results cannot directly be applied to the audio domain. In this paper, we
present the first data-poisoning attack against ASR, called VenoMave. We
evaluate our attack on an ASR system that detects sequences of digits. When
poisoning only 0.17% of the dataset on average, we achieve an attack success
rate of 86.67%. To demonstrate the practical feasibility of our attack, we also
evaluate if the target audio waveform can be played over the air via simulated
room transmissions. In this more realistic threat model, VenoMave still
maintains a success rate up to 73.33%. We further extend our evaluation to the
Speech Commands corpus and demonstrate the scalability of VenoMave to a larger
vocabulary. During a transcription test with human listeners, we verify that
more than 85% of the original text of poisons can be correctly transcribed. We
conclude that data-poisoning attacks against ASR represent a real threat, and
we are able to perform poisoning for arbitrary target input files while the
crafted poison samples remain inconspicuous
Emotion Recognition from EEG Signal Focusing on Deep Learning and Shallow Learning Techniques
Recently, electroencephalogram-based emotion recognition has become crucial in enabling the Human-Computer Interaction (HCI) system to become more intelligent. Due to the outstanding applications of emotion recognition, e.g., person-based decision making, mind-machine interfacing, cognitive interaction, affect detection, feeling detection, etc., emotion recognition has become successful in attracting the recent hype of AI-empowered research. Therefore, numerous studies have been conducted driven by a range of approaches, which demand a systematic review of methodologies used for this task with their feature sets and techniques. It will facilitate the beginners as guidance towards composing an effective emotion recognition system. In this article, we have conducted a rigorous review on the state-of-the-art emotion recognition systems, published in recent literature, and summarized some of the common emotion recognition steps with relevant definitions, theories, and analyses to provide key knowledge to develop a proper framework. Moreover, studies included here were dichotomized based on two categories: i) deep learning-based, and ii) shallow machine learning-based emotion recognition systems. The reviewed systems were compared based on methods, classifier, the number of classified emotions, accuracy, and dataset used. An informative comparison, recent research trends, and some recommendations are also provided for future research directions
- …