20 research outputs found
Model Order Estimation in the Presence of multipath Interference using Residual Convolutional Neural Networks
Model order estimation (MOE) is often a pre-requisite for Direction of
Arrival (DoA) estimation. Due to limits imposed by array geometry, it is
typically not possible to estimate spatial parameters for an arbitrary number
of sources; an estimate of the signal model is usually required. MOE is the
process of selecting the most likely signal model from several candidates.
While classic methods fail at MOE in the presence of coherent multipath
interference, data-driven supervised learning models can solve this problem.
Instead of the classic MLP (Multiple Layer Perceptions) or CNN (Convolutional
Neural Networks) architectures, we propose the application of Residual
Convolutional Neural Networks (RCNN), with grouped symmetric kernel filters to
deliver state-of-art estimation accuracy of up to 95.2\% in the presence of
coherent multipath, and a weighted loss function to eliminate underestimation
error of the model order. We show the benefit of the approach by demonstrating
its impact on an overall signal processing flow that determines the number of
total signals received by the array, the number of independent sources, and the
association of each of the paths with those sources . Moreover, we show that
the proposed estimator provides accurate performance over a variety of array
types, can identify the overloaded scenario, and ultimately provides strong DoA
estimation and signal association performance
Neural Network-Based DOA Estimation in the Presence of Non-Gaussian Interference
This work addresses the problem of direction-of-arrival (DOA) estimation in
the presence of non-Gaussian, heavy-tailed, and spatially-colored interference.
Conventionally, the interference is considered to be Gaussian-distributed and
spatially white. However, in practice, this assumption is not guaranteed, which
results in degraded DOA estimation performance. Maximum likelihood DOA
estimation in the presence of non-Gaussian and spatially colored interference
is computationally complex and not practical. Therefore, this work proposes a
neural network (NN) based DOA estimation approach for spatial spectrum
estimation in multi-source scenarios with a-priori unknown number of sources in
the presence of non-Gaussian spatially-colored interference. The proposed
approach utilizes a single NN instance for simultaneous source enumeration and
DOA estimation. It is shown via simulations that the proposed approach
significantly outperforms conventional and NN-based approaches in terms of
probability of resolution, estimation accuracy, and source enumeration accuracy
in conditions of low SIR, small sample support, and when the angular separation
between the source DOAs and the spatially-colored interference is small.Comment: Submitted to IEEE Transactions on Aerospace and Electronic System
Array signal processing for source localization and enhancement
âA common approach to the wide-band microphone array problem is to assume a certain array geometry and then design optimal weights (often in subbands) to meet a set of desired criteria. In addition to weights, we consider the geometry of the microphone arrangement to be part of the optimization problem. Our approach is to use particle swarm optimization (PSO) to search for the optimal geometry while using an optimal weight design to design the weights for each particleâs geometry. The resulting directivity indices (DIâs) and white noise SNR gains (WNGâs) form the basis of the PSOâs fitness function. Another important consideration in the optimal weight design are several regularization parameters. By including those parameters in the particles, we optimize their values as well in the operation of the PSO. The proposed method allows the user great flexibility in specifying desired DIâs and WNGâs over frequency by virtue of the PSO fitness function.
Although the above method discusses beam and nulls steering for fixed locations, in real time scenarios, it requires us to estimate the source positions to steer the beam position adaptively. We also investigate source localization of sound and RF sources using machine learning techniques. As for the RF source localization, we consider radio frequency identification (RFID) antenna tags. Using a planar RFID antenna array with beam steering capability and using received signal strength indicator (RSSI) value captured for each beam position, the position of each RFID antenna tag is estimated. The proposed approach is also shown to perform well under various challenging scenariosâ--Abstract, page iv
Trennung und SchĂ€tzung der Anzahl von Audiosignalquellen mit Zeit- und FrequenzĂŒberlappung
Everyday audio recordings involve mixture signals: music contains a mixture of instruments; in a meeting or conference, there is a mixture of human voices. For these mixtures, automatically separating or estimating the number of sources is a challenging task. A common assumption when processing mixtures in the time-frequency domain is that sources are not fully overlapped. However, in this work we consider some cases where the overlap is severe â for instance, when instruments play the same note (unison) or when many people speak concurrently ("cocktail party") â highlighting the need for new representations and more powerful models.
To address the problems of source separation and count estimation, we use conventional signal processing techniques as well as deep neural networks (DNN). We ïŹrst address the source separation problem for unison instrument mixtures, studying the distinct spectro-temporal modulations caused by vibrato. To exploit these modulations, we developed a method based on time warping, informed by an estimate of the fundamental frequency. For cases where such estimates are not available, we present an unsupervised model, inspired by the way humans group time-varying sources (common fate). This contribution comes with a novel representation that improves separation for overlapped and modulated sources on unison mixtures but also improves vocal and accompaniment separation when used as an input for a DNN model.
Then, we focus on estimating the number of sources in a mixture, which is important for real-world scenarios. Our work on count estimation was motivated by a study on how humans can address this task, which lead us to conduct listening experiments, conïŹrming that humans are only able to estimate the number of up to four sources correctly. To answer the question of whether machines can perform similarly, we present a DNN architecture, trained to estimate the number of concurrent speakers. Our results show improvements compared to other methods, and the model even outperformed humans on the same task.
In both the source separation and source count estimation tasks, the key contribution of this thesis is the concept of âmodulationâ, which is important to computationally mimic human performance. Our proposed Common Fate Transform is an adequate representation to disentangle overlapping signals for separation, and an inspection of our DNN count estimation model revealed that it proceeds to ïŹnd modulation-like intermediate features.Im Alltag sind wir von gemischten Signalen umgeben: Musik besteht aus einer Mischung von Instrumenten; in einem Meeting oder auf einer Konferenz sind wir einer Mischung menschlicher Stimmen ausgesetzt. FĂŒr diese Mischungen ist die automatische Quellentrennung oder die Bestimmung der Anzahl an Quellen eine anspruchsvolle Aufgabe. Eine hĂ€uïŹge Annahme bei der Verarbeitung von gemischten Signalen im Zeit-Frequenzbereich ist, dass die Quellen sich nicht vollstĂ€ndig ĂŒberlappen. In dieser Arbeit betrachten wir jedoch einige FĂ€lle, in denen die Ăberlappung immens ist zum Beispiel, wenn Instrumente den gleichen Ton spielen (unisono) oder wenn viele Menschen gleichzeitig sprechen (Cocktailparty) â, so dass neue Signal-ReprĂ€sentationen und leistungsfĂ€higere Modelle notwendig sind.
Um die zwei genannten Probleme zu bewĂ€ltigen, verwenden wir sowohl konventionelle Signalverbeitungsmethoden als auch tiefgehende neuronale Netze (DNN). Wir gehen zunĂ€chst auf das Problem der Quellentrennung fĂŒr Unisono-Instrumentenmischungen ein und untersuchen die speziellen, durch Vibrato ausgelösten, zeitlich-spektralen Modulationen. Um diese Modulationen auszunutzen entwickelten wir eine Methode, die auf Zeitverzerrung basiert und eine SchĂ€tzung der Grundfrequenz als zusĂ€tzliche Information nutzt. FĂŒr FĂ€lle, in denen diese SchĂ€tzungen nicht verfĂŒgbar sind, stellen wir ein unĂŒberwachtes Modell vor, das inspiriert ist von der Art und Weise, wie Menschen zeitverĂ€nderliche Quellen gruppieren (Common Fate). Dieser Beitrag enthĂ€lt eine neuartige ReprĂ€sentation, die die Separierbarkeit fĂŒr ĂŒberlappte und modulierte Quellen in Unisono-Mischungen erhöht, aber auch die Trennung in Gesang und Begleitung verbessert, wenn sie in einem DNN-Modell verwendet wird.
Im Weiteren beschĂ€ftigen wir uns mit der SchĂ€tzung der Anzahl von Quellen in einer Mischung, was fĂŒr reale Szenarien wichtig ist. Unsere Arbeit an der SchĂ€tzung der Anzahl war motiviert durch eine Studie, die zeigt, wie wir Menschen diese Aufgabe angehen. Dies hat uns dazu veranlasst, eigene Hörexperimente durchzufĂŒhren, die bestĂ€tigten, dass Menschen nur in der Lage sind, die Anzahl von bis zu vier Quellen korrekt abzuschĂ€tzen. Um nun die Frage zu beantworten, ob Maschinen dies Ă€hnlich gut können, stellen wir eine DNN-Architektur vor, die erlernt hat, die Anzahl der gleichzeitig sprechenden Sprecher zu ermitteln. Die Ergebnisse zeigen Verbesserungen im Vergleich zu anderen Methoden, aber vor allem auch im Vergleich zu menschlichen Hörern.
Sowohl bei der Quellentrennung als auch bei der SchĂ€tzung der Anzahl an Quellen ist ein Kernbeitrag dieser Arbeit das Konzept der âModulationâ, welches wichtig ist, um die Strategien von Menschen mittels Computern nachzuahmen. Unsere vorgeschlagene Common Fate Transformation ist eine adĂ€quate Darstellung, um die Ăberlappung von Signalen fĂŒr die Trennung zugĂ€nglich zu machen und eine Inspektion unseres DNN-ZĂ€hlmodells ergab schlieĂlich, dass sich auch hier modulationsĂ€hnliche Merkmale ïŹnden lassen
Design, implementation and evaluation of an acoustic source localization system using Deep Learning techniques
This Master Thesis presents a novel approach for indoor acoustic source localization using microphone
arrays, based on a Convolutional Neural Network (CNN) that we call the ASLNet. It directly estimates
the three-dimensional position of a single acoustic source using as inputs the raw audio signals from a set
of microphones. We use supervised learning methods to train our network end-to-end. The amount of
labeled training data available for this problem is however small. This Thesis presents a training strategy
based on two steps that mitigates this problem. We first train our network using semi-synthetic data
generated from close talk speech recordings and a mathematical model for signal propagation from the
source to the microphones. The amount of semi-synthetic data can be virtually as large as needed. We
then fine tune the resulting network using a small amount of real data. Our experimental results, evaluated
on a publicly available dataset recorded in a real room, show that this approach is able to improve existing
localization methods based on SRP-PHAT strategies and also those presented in very recent proposals
based on Convolutional Recurrent Neural Networks (CRNN). In addition, our experiments show that the
performance of the ASLNet does not show a relevant dependency on the speakerâs gender, nor on the
size of the signal window being used. This work also investigates methods to improve the generalization
properties of our network using only semi-synthetic data for training. This is a highly important objective
due to the cost of labelling localization data. We proceed by including specific effects in the input signals
to force the network to be insensitive to multipath, high noise and distortion likely to be present in real
scenarios. We obtain promising results with this strategy although they still lack behind strategies based
on fine-tuning.MĂĄster Universitario en IngenierĂa de TelecomunicaciĂłn (M125
Spatial dissection of a soundfield using spherical harmonic decomposition
A real-world soundfield is often contributed by multiple desired and undesired sound sources. The performance of many acoustic systems such as automatic speech recognition, audio surveillance, and teleconference relies on its ability to extract the desired sound components in such a mixed environment. The existing solutions to the above problem are constrained by various fundamental limitations and require to enforce different priors depending on the acoustic condition such as reverberation and spatial distribution of sound sources. With the growing emphasis and integration of audio applications in diverse technologies such as smart home and virtual reality appliances, it is imperative to advance the source separation technology in order to overcome the limitations of the traditional approaches.
To that end, we exploit the harmonic decomposition model to dissect a mixed soundfield into its underlying desired and undesired components based on source and signal characteristics. By analysing the spatial projection of a soundfield, we achieve multiple outcomes such as (i) soundfield separation with respect to distinct source regions, (ii) source separation in a mixed soundfield using modal coherence model, and (iii) direction of arrival (DOA) estimation of multiple overlapping sound sources through pattern recognition of the modal coherence of a soundfield.
We first employ an array of higher order microphones for soundfield separation in order to reduce hardware requirement and implementation complexity. Subsequently, we develop novel mathematical models for modal coherence of noisy and reverberant soundfields that facilitate convenient ways for estimating DOA and power spectral densities leading to robust source separation algorithms. The modal domain approach to the soundfield/source separation allows us to circumvent several practical limitations of the existing techniques and enhance the performance and robustness of the system. The proposed methods are presented with several practical applications and performance evaluations using simulated and real-life dataset
Sound Source Localization and Modeling: Spherical Harmonics Domain Approaches
Sound source localization has been an important research topic in the acoustic signal processing community because of its wide use in many acoustic applications, including speech separation, speech enhancement, sound event detection, automatic speech recognition, automated camera steering, and virtual reality. In the recent decade, there is a growing interest in the research of sound source localization using higher-order microphone arrays, which are capable of recording and analyzing the soundfield over a target spatial area. This thesis studies a novel source feature called the relative harmonic coefficient, that easily estimated from the higher-order microphone measurements. This source feature has direct applications for sound source localization due to its sole dependence on the source position.
This thesis proposes two novel sound source localization algorithms using the relative harmonic coefficients: (i) a low-complexity single source localization approach that localizes the source' elevation and azimuth separately. This approach is also appliable to acoustic enhancement for the higher-order microphone array recordings; (ii) a semi-supervised multi-source localization algorithm in a noisy and reverberant environment. Although this approach uses a learning schema, it still has a strong potential to be implemented in practice because only a limited number of labeled measurements are required. However, this algorithm has an inherent limitation as it requires the availability of single-source components. Thus, it is unusable in scenarios where the original recordings have limited single-source components (e.g., multiple sources simultaneously active). To address this issue, we develop a novel MUSIC framework based approach that directly uses simultaneous multi-source recordings. This developed MUSIC approach uses robust measurements of relative sound pressure from the higher-order microphone and is shown to be more suitable in noisy environments than the traditional MUSIC method.
While the proposed approaches address the source localization problems, in practice, the broader problem of source localization has some more common challenges, which have received less attention. One such challenge is the common assumption of the sound sources being omnidirectional, which is hardly the case with a typical commercial loudspeaker. Therefore, in this thesis, we analyze the broader problem of analyzing directional characteristics of the commercial loudspeakers by deriving equivalent theoretical acoustic models. Several acoustic models are investigated, including plane waves decomposition, point source decomposition, and mixed source decomposition. We finally conduct extensive experimental examinations to see which acoustic model has more similar characteristics with commercial loudspeakers