102 research outputs found

    Acoustic DOA estimation using space alternating sparse Bayesian learning

    Get PDF

    Adaptive time-frequency analysis for cognitive source separation

    Get PDF
    This thesis introduces a framework for separating two speech sources in non-ideal, reverberant environments. The source separation architecture tries to mimic the extraordinary abilities of the human auditory system when performing source separation. A movable human dummy head residing in a normal office room is used to model the conditions humans experience when listening to complex auditory scenes. This thesis first investigates how the orthogonality of speech sources in the time-frequency domain drops with different reverberation times of the environment and shows that separation schemes based on ideal binary time-frequency-masks are suitable to perform source separation also under humanoid reverberant conditions. Prior to separating the sources, the movable human dummy head analyzes the auditory scene and estimates the positions of the sources and the fundamental frequency tracks. The source localization is implemented using an iterative approach based on the interaural time differences between the two ears and achieves a localization blur of less than three degrees in the azimuth plane. The source separation architecture implemented in this thesis extracts the orthogonal timefrequency points of the speech mixtures. It combines the positive features of the STFT with the positive features of the cochleagram representation. The overall goal of the source separation is to find the ideal STFT-mask. The core source separation process however is based on the analysis of the corresponding region in an additionally computed cochleagram, which shows more reliable Interaural Time Difference (ITD) estimations that are used for separation. Several algorithms based on the ITD and the fundamental frequency of the target source are evaluated for their source separation capabilities. To enhance the separation capabilities of the single algorithms, the results of the different algorithms are combined to compute a final estimate. In this way SIR gains of approximately 30 dB for two source scenarios are achieved. For three source scenarios SIR gains of up to 16 dB are attained. Compared to the standard binaural signal processing approaches like DUET and Fixed Beamforming the presented approach achieves up to 29 dB SIR gain.Diese Dissertation beschreibt ein Framework zur Separation zweier Quellen in nicht-idealen, echobehafteten Umgebungen. Die Architektur zur Quellenseparation orientiert sich dabei an den außergewöhnlichen Separationsfähigkeiten des menschlichen Gehörs. Um die Bedingungen eines Menschen in einer komplexen auditiven Szene zu imitieren, wird ein beweglicher, menschlicher Kunstkopf genutzt, der sich in einem üblichen Büroraum befindet. In einem ersten Schritt analysiert diese Dissertation, inwiefern die Orthogonalität von Sprachsignalen im Zeit-Frequenz-Bereich mit unterschiedlichen Nachhallzeiten abnimmt. Trotz der Orthogonalitätsabnahme sind Separationsansätze basierend auf idealen binären Masken geeignet um eine Trennung von Sprachsignalen auch unter menschlichen, echobehafteten Bedingungen zu realisieren. Bevor die Quellen getrennt werden, analysiert der bewegliche Kunstkopf die auditive Szene und schätzt die Positionen der einzelnen Quellen und den Verlauf der Grundfrequenz der Sprecher ab. Die Quellenlokalisation wird durch einen iterativen Ansatz basierend auf den Zeitunterschieden zwischen beiden Ohren verwirklicht und erreicht eine Lokalisierungsgenauigkeit von weniger als drei Grad in der Azimuth-Ebene. Die Quellenseparationsarchitektur die in dieser Arbeit implementiert wird, extrahiert die orthogonalen Zeit-Frequenz-Punkte der Sprachmixturen. Dazu werden die positiven Eigenschaften der STFT mit den positiven Eigenschaften des Cochleagrams kombiniert. Ziel ist es, die ideale STFT-Maske zu finden. Die eigentliche Quellentrennung basiert jedoch auf der Analyse der entsprechenden Region eines zusätzlich berechneten Cochleagrams. Auf diese Weise wird eine weitaus verlässlichere Auswertung der Zeitunterschiede zwischen den beiden Ohren verwirklicht. Mehrere Algorithmen basierend auf den interauralen Zeitunterschieden und der Grundfrequenz der Zielquelle werden bezüglich ihrer Separationsfähigkeiten evaluiert. Um die Trennungsmöglichkeiten der einzelnen Algorithmen zu erhöhen, werden die einzelnen Ergebnisse miteinander verknüpft um eine finale Abschätzung zu gewinnen. Auf diese Weise können SIR Gewinne von ungefähr 30 dB für Szenarien mit zwei Quellen erzielt werden. Für Szenarien mit drei Quellen werden Gewinne von bis zu 16 dB erzielt. Verglichen mit binauralen Standardverfahren zur Quellentrennung wie DUET oder Fixed Beamforming, gewinnt der vorgestellte Ansatz bis zu 29 dB SIR

    Acoustic Echo Estimation using the model-based approach with Application to Spatial Map Construction in Robotics

    Get PDF

    Informed Sound Source Localization for Hearing Aid Applications

    Get PDF

    On Improved Training of CNN for Acoustic Source Localisation

    Get PDF

    Detecting, locating and recognising human touches in social robots with contact microphones

    Get PDF
    There are many situations in our daily life where touch gestures during natural human–human interaction take place: meeting people (shaking hands), personal relationships (caresses), moments of celebration or sadness (hugs), etc. Considering that robots are expected to form part of our daily life in the future, they should be endowed with the capacity of recognising these touch gestures and the part of its body that has been touched since the gesture’s meaning may differ. Therefore, this work presents a learning system for both purposes: detect and recognise the type of touch gesture (stroke, tickle, tap and slap) and its localisation. The interpretation of the meaning of the gesture is out of the scope of this paper. Different technologies have been applied to perceive touch by a social robot, commonly using a large number of sensors. Instead, our approach uses 3 contact microphones installed inside some parts of the robot. The audio signals generated when the user touches the robot are sensed by the contact microphones and processed using Machine Learning techniques. We acquired information from sensors installed in two social robots, Maggie and Mini (both developed by the RoboticsLab at the Carlos III University of Madrid), and a real-time version of the whole system has been deployed in the robot Mini. The system allows the robot to sense if it has been touched or not, to recognise the kind of touch gesture, and its approximate location. The main advantage of using contact microphones as touch sensors is that by using just one, it is possible to “cover” a whole solid part of the robot. Besides, the sensors are unaffected by ambient noises, such as human voice, TV, music etc. Nevertheless, the fact of using several contact microphones makes possible that a touch gesture is detected by all of them, and each may recognise a different gesture at the same time. The results show that this system is robust against this phenomenon. Moreover, the accuracy obtained for both robots is about 86%.The research leading to these results has received funding from the projects: ‘‘Robots Sociales para Estimulación Física, Cognitiva y Afectiva de Mayores (ROSES)’’, funded by the Spanish "Ministerio de Ciencia, Innovación y Universidades, Spain" and from RoboCity2030-DIH-CM, Madrid Robotics Digital Innovation Hub, S2018/NMT-4331, funded by ‘"Programas de Actividades I+D en la Comunidad de Madrid’" and cofunded by Structural Funds of the EU, Slovak Republic.Publicad
    corecore