9,249 research outputs found

    Towards End-to-End Acoustic Localization using Deep Learning: from Audio Signal to Source Position Coordinates

    Full text link
    This paper presents a novel approach for indoor acoustic source localization using microphone arrays and based on a Convolutional Neural Network (CNN). The proposed solution is, to the best of our knowledge, the first published work in which the CNN is designed to directly estimate the three dimensional position of an acoustic source, using the raw audio signal as the input information avoiding the use of hand crafted audio features. Given the limited amount of available localization data, we propose in this paper a training strategy based on two steps. We first train our network using semi-synthetic data, generated from close talk speech recordings, and where we simulate the time delays and distortion suffered in the signal that propagates from the source to the array of microphones. We then fine tune this network using a small amount of real data. Our experimental results show that this strategy is able to produce networks that significantly improve existing localization methods based on \textit{SRP-PHAT} strategies. In addition, our experiments show that our CNN method exhibits better resistance against varying gender of the speaker and different window sizes compared with the other methods.Comment: 18 pages, 3 figures, 8 table

    Block-Online Multi-Channel Speech Enhancement Using DNN-Supported Relative Transfer Function Estimates

    Get PDF
    This work addresses the problem of block-online processing for multi-channel speech enhancement. Such processing is vital in scenarios with moving speakers and/or when very short utterances are processed, e.g., in voice assistant scenarios. We consider several variants of a system that performs beamforming supported by DNN-based voice activity detection (VAD) followed by post-filtering. The speaker is targeted through estimating relative transfer functions between microphones. Each block of the input signals is processed independently in order to make the method applicable in highly dynamic environments. Owing to the short length of the processed block, the statistics required by the beamformer are estimated less precisely. The influence of this inaccuracy is studied and compared to the processing regime when recordings are treated as one block (batch processing). The experimental evaluation of the proposed method is performed on large datasets of CHiME-4 and on another dataset featuring moving target speaker. The experiments are evaluated in terms of objective and perceptual criteria (such as signal-to-interference ratio (SIR) or perceptual evaluation of speech quality (PESQ), respectively). Moreover, word error rate (WER) achieved by a baseline automatic speech recognition system is evaluated, for which the enhancement method serves as a front-end solution. The results indicate that the proposed method is robust with respect to short length of the processed block. Significant improvements in terms of the criteria and WER are observed even for the block length of 250 ms.Comment: 10 pages, 8 figures, 4 tables. Modified version of the article accepted for publication in IET Signal Processing journal. Original results unchanged, additional experiments presented, refined discussion and conclusion

    Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments

    Get PDF
    We address the problem of online localization and tracking of multiple moving speakers in reverberant environments. The paper has the following contributions. We use the direct-path relative transfer function (DP-RTF), an inter-channel feature that encodes acoustic information robust against reverberation, and we propose an online algorithm well suited for estimating DP-RTFs associated with moving audio sources. Another crucial ingredient of the proposed method is its ability to properly assign DP-RTFs to audio-source directions. Towards this goal, we adopt a maximum-likelihood formulation and we propose to use an exponentiated gradient (EG) to efficiently update source-direction estimates starting from their currently available values. The problem of multiple speaker tracking is computationally intractable because the number of possible associations between observed source directions and physical speakers grows exponentially with time. We adopt a Bayesian framework and we propose a variational approximation of the posterior filtering distribution associated with multiple speaker tracking, as well as an efficient variational expectation-maximization (VEM) solver. The proposed online localization and tracking method is thoroughly evaluated using two datasets that contain recordings performed in real environments.Comment: IEEE Journal of Selected Topics in Signal Processing, 201

    Influence of microphone housing on the directional response of piezoelectric mems microphones inspired by Ormia ochracea

    Get PDF
    The influence of custom microphone housings on the acoustic directionality and frequency response of a multiband bio-inspired MEMS microphone is presented. The 3.2 mm by 1.7 mm piezoelectric MEMS microphone, fabricated by a cost-effective multi-user process, has four frequency bands of operation below 10 kHz, with a desired first-order directionality for all four bands. 7×7×2.5 mm3 3-D-printed bespoke housings with varying acoustic access to the backside of the microphone membrane are investigated through simulation and experiment with respect to their influence on the directionality and frequency response to sound stimulus. Results show a clear link between directionality and acoustic access to the back cavity of the microphone. Furthermore, there was a change in direction of the first-order directionality with reduced height in this back cavity acoustic access. The required configuration for creating an identical directionality for all four frequency bands is investigated along with the influence of reducing the symmetry of the acoustic back cavity access. This paper highlights the overall requirement of considering housing geometries and their influence on acoustic behavior for bio-inspired directional microphones

    Size constancy in bat biosonar?

    Get PDF
    Perception and encoding of object size is an important feature of sensory systems. In the visual system object size is encoded by the visual angle (visual aperture) on the retina, but the aperture depends on the distance of the object. As object distance is not unambiguously encoded in the visual system, higher computational mechanisms are needed. This phenomenon is termed "size constancy". It is assumed to reflect an automatic re-scaling of visual aperture with perceived object distance. Recently, it was found that in echolocating bats, the 'sonar aperture', i.e., the range of angles from which sound is reflected from an object back to the bat, is unambiguously perceived and neurally encoded. Moreover, it is well known that object distance is accurately perceived and explicitly encoded in bat sonar. Here, we addressed size constancy in bat biosonar, recruiting virtual-object techniques. Bats of the species Phyllostomus discolor learned to discriminate two simple virtual objects that only differed in sonar aperture. Upon successful discrimination, test trials were randomly interspersed using virtual objects that differed in both aperture and distance. It was tested whether the bats spontaneously assigned absolute width information to these objects by combining distance and aperture. The results showed that while the isolated perceptual cues encoding object width, aperture, and distance were all perceptually well resolved by the bats, the animals did not assign absolute width information to the test objects. This lack of sonar size constancy may result from the bats relying on different modalities to extract size information at different distances. Alternatively, it is conceivable that familiarity with a behaviorally relevant, conspicuous object is required for sonar size constancy, as it has been argued for visual size constancy. Based on the current data, it appears that size constancy is not necessarily an essential feature of sonar perception in bats
    • …
    corecore