6 research outputs found
Localization of sound sources : a systematic review
Sound localization is a vast field of research and advancement which is used in many useful applications to facilitate communication, radars, medical aid, and speech enhancement to but name a few. Many different methods are presented in recent times in this field to gain benefits. Various types of microphone arrays serve the purpose of sensing the incoming sound. This paper presents an overview of the importance of using sound localization in different applications along with the use and limitations of ad-hoc microphones over other microphones. In order to overcome these limitations certain approaches are also presented. Detailed explanation of some of the existing methods that are used for sound localization using microphone arrays in the recent literature is given. Existing methods are studied in a comparative fashion along with the factors that influence the choice
of one method over the others. This review is done in order to form a basis for choosing the best fit method for our use
Semi-supervised source localization in reverberant environments with deep generative modeling
We propose a semi-supervised approach to acoustic source localization in
reverberant environments based on deep generative modeling. Localization in
reverberant environments remains an open challenge. Even with large data
volumes, the number of labels available for supervised learning in reverberant
environments is usually small. We address this issue by performing
semi-supervised learning (SSL) with convolutional variational autoencoders
(VAEs) on reverberant speech signals recorded with microphone arrays. The VAE
is trained to generate the phase of relative transfer functions (RTFs) between
microphones, in parallel with a direction of arrival (DOA) classifier based on
RTF-phase. These models are trained using both labeled and unlabeled RTF-phase
sequences. In learning to perform these tasks, the VAE-SSL explicitly learns to
separate the physical causes of the RTF-phase (i.e., source location) from
distracting signal characteristics such as noise and speech activity. Relative
to existing semi-supervised localization methods in acoustics, VAE-SSL is
effectively an end-to-end processing approach which relies on minimal
preprocessing of RTF-phase features. As far as we are aware, our paper presents
the first approach to modeling the physics of acoustic propagation using deep
generative modeling. The VAE-SSL approach is compared with two signal
processing-based approaches, steered response power with phase transform
(SRP-PHAT) and MUltiple SIgnal Classification (MUSIC), as well as fully
supervised CNNs. We find that VAE-SSL can outperform the conventional
approaches and the CNN in label-limited scenarios. Further, the trained VAE-SSL
system can generate new RTF-phase samples, which shows the VAE-SSL approach
learns the physics of the acoustic environment. The generative modeling in
VAE-SSL thus provides a means of interpreting the learned representations.Comment: Revision, submitted to IEEE Acces
The Head Turning Modulation System: An Active Multimodal Paradigm for Intrinsically Motivated Exploration of Unknown Environments
Over the last 20 years, a significant part of the research in exploratory robotics partially switches from looking for the most efficient way of exploring an unknown environment to finding what could motivate a robot to autonomously explore it. Moreover, a growing literature focuses not only on the topological description of a space (dimensions, obstacles, usable paths, etc.) but rather on more semantic components, such as multimodal objects present in it. In the search of designing robots that behave autonomously by embedding life-long learning abilities, the inclusion of mechanisms of attention is of importance. Indeed, be it endogenous or exogenous, attention constitutes a form of intrinsic motivation for it can trigger motor command toward specific stimuli, thus leading to an exploration of the space. The Head Turning Modulation model presented in this paper is composed of two modules providing a robot with two different forms of intrinsic motivations leading to triggering head movements toward audiovisual sources appearing in unknown environments. First, the Dynamic Weighting module implements a motivation by the concept of Congruence, a concept defined as an adaptive form of semantic saliency specific for each explored environment. Then, the Multimodal Fusion and Inference module implements a motivation by the reduction of Uncertainty through a self-supervised online learning algorithm that can autonomously determine local consistencies. One of the novelty of the proposed model is to solely rely on semantic inputs (namely audio and visual labels the sources belong to), in opposition to the traditional analysis of the low-level characteristics of the perceived data. Another contribution is found in the way the exploration is exploited to actively learn the relationship between the visual and auditory modalities. Importantly, the robot—endowed with binocular vision, binaural audition and a rotating head—does not have access to prior information about the different environments it will explore. Consequently, it will have to learn in real-time what audiovisual objects are of “importance” in order to rotate its head toward them. Results presented in this paper have been obtained in simulated environments as well as with a real robot in realistic experimental conditions