92 research outputs found

    Multi-source TDOA estimation in reverberant audio using angular spectra and clustering

    Get PDF
    In this article, we consider the problem of estimating the time differences of arrival (TDOAs) of multiple sources from two-channel reverberant audio mixtures. This is commonly achieved using clustering or angular spectrum-based methods. These methods are limited in that they typically affect the same weight to the spatial information provided by all time-frequency bins and rely on a binary activation model of the sources. Moreover, few experimental comparisons of different methods have been carried out so far. We introduce two new groups of TDOA estimation methods. First, we propose a time-frequency weighting procedure based on a form of signal-to-noise-ratio (SNR) that was shown to be efficient for instantaneous mixtures. Second, we introduce new clustering algorithms based on the assumption that all sources can be active in each time-frequency bin. We also study a two-step procedure combining angular spectra and clustering and conduct a large-scale experimental evaluation of the proposed and existing methods. The best average localization performance is achieved by a variant of the generalized cross-correlation with phase transform (GCC-PHAT) method without subsequent clustering. Moreover, one of the SNR-based methods we propose outperforms this method for small microphone spacing.Dans cet article, nous considérons le problème d'estimation des différences de temps d'arrivée (TDOAs) de plusieurs sources sonores dans un enregistrement stéréophonique en environnement réverbérant. Ce problème est communément traité par des méthodes de type clustering ou spectre angulaire. Ces méthodes sont limitées par le fait qu'elle affectent typiquement le même poids à l'information spatiale issue de tous les points temps-fréquence et qu'elles se basent sur un modèle binaire d'activation des sources. De plus, peu de comparaisons expérimentales ont été effectuées jusqu'à présent. Premièrement, nous proposons une procédure de pondération temps-fréquence basée sur une forme de rapport signal-à-bruit (RSB) dont l'efficacité a été montrée pour des mélanges instantanés. Deuxièmement, nous introduisons de nouveaux algorithmes de clustering basés sur l'hypothèse que toutes les sources peuvent être actives en chaque point temps-fréquence. Nous étudions également une procédure en deux étapes combinant le spectre angulaire et le clustering et nous menons une évaluation expérimentale à grande échelle des méthodes proposées et existantes. En moyenne, les meilleures performances de localisation ont été obtenues par une version de GCC-PHAT (Generalized Cross Correlation with Phase Transform) sans avoir recours au clustering. De plus, une des méthodes basées sur le RSB que nous proposons se révèle plus performante que cette dernière lorsque la distance entre les microphones est petite

    Acoustic sensor network geometry calibration and applications

    Get PDF
    In the modern world, we are increasingly surrounded by computation devices with communication links and one or more microphones. Such devices are, for example, smartphones, tablets, laptops or hearing aids. These devices can work together as nodes in an acoustic sensor network (ASN). Such networks are a growing platform that opens the possibility for many practical applications. ASN based speech enhancement, source localization, and event detection can be applied for teleconferencing, camera control, automation, or assisted living. For this kind of applications, the awareness of auditory objects and their spatial positioning are key properties. In order to provide these two kinds of information, novel methods have been developed in this thesis. Information on the type of auditory objects is provided by a novel real-time sound classification method. Information on the position of human speakers is provided by a novel localization and tracking method. In order to localize with respect to the ASN, the relative arrangement of the sensor nodes has to be known. Therefore, different novel geometry calibration methods were developed. Sound classification The first method addresses the task of identification of auditory objects. A novel application of the bag-of-features (BoF) paradigm on acoustic event classification and detection was introduced. It can be used for event and speech detection as well as for speaker identification. The use of both mel frequency cepstral coefficient (MFCC) and Gammatone frequency cepstral coefficient (GFCC) features improves the classification accuracy. By using soft quantization and introducing supervised training for the BoF model, superior accuracy is achieved. The method generalizes well from limited training data. It is working online and can be computed in a fraction of real-time. By a dedicated training strategy based on a hierarchy of stationarity, the detection of speech in mixtures with noise was realized. This makes the method robust against severe noises levels corrupting the speech signal. Thus it is possible to provide control information to a beamformer in order to realize blind speech enhancement. A reliable improvement is achieved in the presence of one or more stationary noise sources. Speaker localization The localization method enables each node to determine the direction of arrival (DoA) of concurrent sound sources. The author's neuro-biologically inspired speaker localization method for microphone arrays was refined for the use in ASNs. By implementing a dedicated cochlear and midbrain model, it is robust against the reverberation found in indoor rooms. In order to better model the unknown number of concurrent speakers, an application of the EM algorithm that realizes probabilistic clustering according to auditory scene analysis (ASA) principles was introduced. Based on this approach, a system for Euclidean tracking in ASNs was designed. Each node applies the node wise localization method and shares probabilistic DoA estimates together with an estimate of the spectral distribution with the network. As this information is relatively sparse, it can be transmitted with low bandwidth. The system is robust against jitter and transmission errors. The information from all nodes is integrated according to spectral similarity to correctly associate concurrent speakers. By incorporating the intersection angle in the triangulation, the precision of the Euclidean localization is improved. Tracks of concurrent speakers are computed over time, as is shown with recordings in a reverberant room. Geometry calibration The central task of geometry calibration has been solved with special focus on sensor nodes equipped with multiple microphones. Novel methods were developed for different scenarios. An audio-visual method was introduced for the calibration of ASNs in video conferencing scenarios. The DoAs estimates are fused with visual speaker tracking in order to provide sensor positions in a common coordinate system. A novel acoustic calibration method determines the relative positioning of the nodes from ambient sounds alone. Unlike previous methods that only infer the positioning of distributed microphones, the DoA is incorporated and thus it becomes possible to calibrate the orientation of the nodes with a high accuracy. This is very important for all applications using the spatial information, as the triangulation error increases dramatically with bad orientation estimates. As speech events can be used, the calibration becomes possible without the requirement of playing dedicated calibration sounds. Based on this, an online method employing a genetic algorithm with incremental measurements was introduced. By using the robust speech localization method, the calibration is computed in parallel to the tracking. The online method is be able to calibrate ASNs in real time, as is shown with recordings of natural speakers in a reverberant room. The informed acoustic sensor network All new methods are important building blocks for the use of ASNs. The online methods for localization and calibration both make use of the neuro-biologically inspired processing in the nodes which leads to state-of-the-art results, even in reverberant enclosures. The high robustness and reliability can be improved even more by including the event detection method in order to exclude non-speech events. When all methods are combined, both semantic information on what is happening in the acoustic scene as well as spatial information on the positioning of the speakers and sensor nodes is automatically acquired in real time. This realizes truly informed audio processing in ASNs. Practical applicability is shown by application to recordings in reverberant rooms. The contribution of this thesis is thus not only to advance the state-of-the-art in automatically acquiring information on the acoustic scene, but also pushing the practical applicability of such methods

    Model-based Sparse Component Analysis for Reverberant Speech Localization

    Get PDF
    In this paper, the problem of multiple speaker localization via speech separation based on model-based sparse recovery is studies. We compare and contrast computational sparse optimization methods incorporating harmonicity and block structures as well as autoregressive dependencies underlying spectrographic representation of speech signals. The results demonstrate the effectiveness of block sparse Bayesian learning framework incorporating autoregressive correlations to achieve a highly accurate localization performance. Furthermore, significant improvement is obtained using ad-hoc microphones for data acquisition set-up compared to the compact microphone array

    An Iterative Approach to Source Counting and Localization Using Two Distant Microphones

    Get PDF

    Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments

    Get PDF
    We address the problem of online localization and tracking of multiple moving speakers in reverberant environments. The paper has the following contributions. We use the direct-path relative transfer function (DP-RTF), an inter-channel feature that encodes acoustic information robust against reverberation, and we propose an online algorithm well suited for estimating DP-RTFs associated with moving audio sources. Another crucial ingredient of the proposed method is its ability to properly assign DP-RTFs to audio-source directions. Towards this goal, we adopt a maximum-likelihood formulation and we propose to use an exponentiated gradient (EG) to efficiently update source-direction estimates starting from their currently available values. The problem of multiple speaker tracking is computationally intractable because the number of possible associations between observed source directions and physical speakers grows exponentially with time. We adopt a Bayesian framework and we propose a variational approximation of the posterior filtering distribution associated with multiple speaker tracking, as well as an efficient variational expectation-maximization (VEM) solver. The proposed online localization and tracking method is thoroughly evaluated using two datasets that contain recordings performed in real environments.Comment: IEEE Journal of Selected Topics in Signal Processing, 201

    Source counting in real-time sound source localization using a circular microphone array

    Get PDF
    International audienceRecently, we proposed an approach inspired by Sparse Component Analysis for real-time localization of multiple sound sources using a circular microphone array. The method was based on identifying time-frequency zones where only one source is active, reducing the problem to single-source localization for these zones. A histogram of estimated Directions of Arrival (DOAs) was formed and then processed to obtain improved DOAestimates, assuming that the number of sources was known. In this paper, we extend our previous work by proposing three different methods for counting the number of sources by looking for prominent peaks in the derived histogram based on: (a) performing a peak search, (b) processing an LPC-smoothed version of the histogram, (c) employing a matching pursuit-based approach. The third approach is shown to perform very accurately in simulated reverberant conditions and additive noise, and its computational requirements are very small

    A non-intrusive method for estimating binaural speech intelligibility from noise-corrupted signals captured by a pair of microphones

    Get PDF
    A non-intrusive method is introduced to predict binaural speech intelligibility in noise directly from signals captured using a pair of microphones. The approach combines signal processing techniques in blind source separation and localisation, with an intrusive objective intelligibility measure (OIM). Therefore, unlike classic intrusive OIMs, this method does not require a clean reference speech signal and knowing the location of the sources to operate. The proposed approach is able to estimate intelligibility in stationary and fluctuating noises, when the noise masker is presented as a point or diffused source, and is spatially separated from the target speech source on a horizontal plane. The performance of the proposed method was evaluated in two rooms. When predicting subjective intelligibility measured as word recognition rate, this method showed reasonable predictive accuracy with correlation coefficients above 0.82, which is comparable to that of a reference intrusive OIM in most of the conditions. The proposed approach offers a solution for fast binaural intelligibility prediction, and therefore has practical potential to be deployed in situations where on-site speech intelligibility is a concern
    • …