330 research outputs found

    Real-time Microphone Array Processing for Sound-field Analysis and Perceptually Motivated Reproduction

    Get PDF
    This thesis details real-time implementations of sound-field analysis and perceptually motivated reproduction methods for visualisation and auralisation purposes. For the former, various methods for visualising the relative distribution of sound energy from one point in space are investigated and contrasted; including a novel reformulation of the cross-pattern coherence (CroPaC) algorithm, which integrates a new side-lobe suppression technique. Whereas for auralisation applications, listening tests were conducted to compare ambisonics reproduction with a novel headphone formulation of the directional audio coding (DirAC) method. The results indicate that the side-lobe suppressed CroPaC method offers greater spatial selectivity in reverberant conditions compared with other popular approaches, and that the new DirAC formulation yields higher perceived spatial accuracy when compared to the ambisonics method

    Blind identification of Ambisonic reduced room impulse response

    Full text link
    Recently proposed Generalized Time-domain Velocity Vector (GTVV) is a generalization of relative room impulse response in spherical harmonic (aka Ambisonic) domain that allows for blind estimation of early-echo parameters: the directions and relative delays of individual reflections. However, the derived closed-form expression of GTVV mandates few assumptions to hold, most important being that the impulse response of the reference signal needs to be a minimum-phase filter. In practice, the reference is obtained by spatial filtering towards the Direction-of-Arrival of the source, and the aforementioned condition is bounded by the performance of the applied beamformer (and thus, by the Ambisonic array order). In the present work, we suggest to circumvent this problem by properly modelling the GTVV time series, which permits not only to relax the initial assumptions, but also to extract the information therein is a more consistent and efficient manner, entering the realm of blind system identification. Experiments using measured room impulse responses confirm the effectiveness of the proposed approach.Comment: Submitte


    Get PDF
    筑波大学 (University of Tsukuba)201

    Raking the Cocktail Party

    Get PDF
    We present the concept of an acoustic rake receiver---a microphone beamformer that uses echoes to improve the noise and interference suppression. The rake idea is well-known in wireless communications; it involves constructively combining different multipath components that arrive at the receiver antennas. Unlike spread-spectrum signals used in wireless communications, speech signals are not orthogonal to their shifts. Therefore, we focus on the spatial structure, rather than temporal. Instead of explicitly estimating the channel, we create correspondences between early echoes in time and image sources in space. These multiple sources of the desired and the interfering signal offer additional spatial diversity that we can exploit in the beamformer design. We present several "intuitive" and optimal formulations of acoustic rake receivers, and show theoretically and numerically that the rake formulation of the maximum signal-to-interference-and-noise beamformer offers significant performance boosts in terms of noise and interference suppression. Beyond signal-to-noise ratio, we observe gains in terms of the \emph{perceptual evaluation of speech quality} (PESQ) metric for the speech quality. We accompany the paper by the complete simulation and processing chain written in Python. The code and the sound samples are available online at \url{http://lcav.github.io/AcousticRakeReceiver/}.Comment: 12 pages, 11 figures, Accepted for publication in IEEE Journal on Selected Topics in Signal Processing (Special Issue on Spatial Audio

    Nonlinear Least Squares Methods for Joint DOA and Pitch Estimation

    Get PDF

    Adaptive time-frequency analysis for cognitive source separation

    Get PDF
    This thesis introduces a framework for separating two speech sources in non-ideal, reverberant environments. The source separation architecture tries to mimic the extraordinary abilities of the human auditory system when performing source separation. A movable human dummy head residing in a normal office room is used to model the conditions humans experience when listening to complex auditory scenes. This thesis first investigates how the orthogonality of speech sources in the time-frequency domain drops with different reverberation times of the environment and shows that separation schemes based on ideal binary time-frequency-masks are suitable to perform source separation also under humanoid reverberant conditions. Prior to separating the sources, the movable human dummy head analyzes the auditory scene and estimates the positions of the sources and the fundamental frequency tracks. The source localization is implemented using an iterative approach based on the interaural time differences between the two ears and achieves a localization blur of less than three degrees in the azimuth plane. The source separation architecture implemented in this thesis extracts the orthogonal timefrequency points of the speech mixtures. It combines the positive features of the STFT with the positive features of the cochleagram representation. The overall goal of the source separation is to find the ideal STFT-mask. The core source separation process however is based on the analysis of the corresponding region in an additionally computed cochleagram, which shows more reliable Interaural Time Difference (ITD) estimations that are used for separation. Several algorithms based on the ITD and the fundamental frequency of the target source are evaluated for their source separation capabilities. To enhance the separation capabilities of the single algorithms, the results of the different algorithms are combined to compute a final estimate. In this way SIR gains of approximately 30 dB for two source scenarios are achieved. For three source scenarios SIR gains of up to 16 dB are attained. Compared to the standard binaural signal processing approaches like DUET and Fixed Beamforming the presented approach achieves up to 29 dB SIR gain.Diese Dissertation beschreibt ein Framework zur Separation zweier Quellen in nicht-idealen, echobehafteten Umgebungen. Die Architektur zur Quellenseparation orientiert sich dabei an den außergewöhnlichen Separationsfähigkeiten des menschlichen Gehörs. Um die Bedingungen eines Menschen in einer komplexen auditiven Szene zu imitieren, wird ein beweglicher, menschlicher Kunstkopf genutzt, der sich in einem üblichen Büroraum befindet. In einem ersten Schritt analysiert diese Dissertation, inwiefern die Orthogonalität von Sprachsignalen im Zeit-Frequenz-Bereich mit unterschiedlichen Nachhallzeiten abnimmt. Trotz der Orthogonalitätsabnahme sind Separationsansätze basierend auf idealen binären Masken geeignet um eine Trennung von Sprachsignalen auch unter menschlichen, echobehafteten Bedingungen zu realisieren. Bevor die Quellen getrennt werden, analysiert der bewegliche Kunstkopf die auditive Szene und schätzt die Positionen der einzelnen Quellen und den Verlauf der Grundfrequenz der Sprecher ab. Die Quellenlokalisation wird durch einen iterativen Ansatz basierend auf den Zeitunterschieden zwischen beiden Ohren verwirklicht und erreicht eine Lokalisierungsgenauigkeit von weniger als drei Grad in der Azimuth-Ebene. Die Quellenseparationsarchitektur die in dieser Arbeit implementiert wird, extrahiert die orthogonalen Zeit-Frequenz-Punkte der Sprachmixturen. Dazu werden die positiven Eigenschaften der STFT mit den positiven Eigenschaften des Cochleagrams kombiniert. Ziel ist es, die ideale STFT-Maske zu finden. Die eigentliche Quellentrennung basiert jedoch auf der Analyse der entsprechenden Region eines zusätzlich berechneten Cochleagrams. Auf diese Weise wird eine weitaus verlässlichere Auswertung der Zeitunterschiede zwischen den beiden Ohren verwirklicht. Mehrere Algorithmen basierend auf den interauralen Zeitunterschieden und der Grundfrequenz der Zielquelle werden bezüglich ihrer Separationsfähigkeiten evaluiert. Um die Trennungsmöglichkeiten der einzelnen Algorithmen zu erhöhen, werden die einzelnen Ergebnisse miteinander verknüpft um eine finale Abschätzung zu gewinnen. Auf diese Weise können SIR Gewinne von ungefähr 30 dB für Szenarien mit zwei Quellen erzielt werden. Für Szenarien mit drei Quellen werden Gewinne von bis zu 16 dB erzielt. Verglichen mit binauralen Standardverfahren zur Quellentrennung wie DUET oder Fixed Beamforming, gewinnt der vorgestellte Ansatz bis zu 29 dB SIR

    Statistics of natural reverberation enable perceptual separation of sound and space

    Get PDF
    In everyday listening, sound reaches our ears directly from a source as well as indirectly via reflections known as reverberation. Reverberation profoundly distorts the sound from a source, yet humans can both identify sound sources and distinguish environments from the resulting sound, via mechanisms that remain unclear. The core computational challenge is that the acoustic signatures of the source and environment are combined in a single signal received by the ear. Here we ask whether our recognition of sound sources and spaces reflects an ability to separate their effects and whether any such separation is enabled by statistical regularities of real-world reverberation. To first determine whether such statistical regularities exist, we measured impulse responses (IRs) of 271 spaces sampled from the distribution encountered by humans during daily life. The sampled spaces were diverse, but their IRs were tightly constrained, exhibiting exponential decay at frequency-dependent rates: Mid frequencies reverberated longest whereas higher and lower frequencies decayed more rapidly, presumably due to absorptive properties of materials and air. To test whether humans leverage these regularities, we manipulated IR decay characteristics in simulated reverberant audio. Listeners could discriminate sound sources and environments from these signals, but their abilities degraded when reverberation characteristics deviated from those of real-world environments. Subjectively, atypical IRs were mistaken for sound sources. The results suggest the brain separates sound into contributions from the source and the environment, constrained by a prior on natural reverberation. This separation process may contribute to robust recognition while providing information about spaces around us