98 research outputs found

    Model-based Sparse Component Analysis for Reverberant Speech Localization

    Get PDF
    In this paper, the problem of multiple speaker localization via speech separation based on model-based sparse recovery is studies. We compare and contrast computational sparse optimization methods incorporating harmonicity and block structures as well as autoregressive dependencies underlying spectrographic representation of speech signals. The results demonstrate the effectiveness of block sparse Bayesian learning framework incorporating autoregressive correlations to achieve a highly accurate localization performance. Furthermore, significant improvement is obtained using ad-hoc microphones for data acquisition set-up compared to the compact microphone array

    Studies on noise robust automatic speech recognition

    Get PDF
    Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

    Sound Source Separation

    Get PDF
    This is the author's accepted pre-print of the article, first published as G. Evangelista, S. Marchand, M. D. Plumbley and E. Vincent. Sound source separation. In U. Zölzer (ed.), DAFX: Digital Audio Effects, 2nd edition, Chapter 14, pp. 551-588. John Wiley & Sons, March 2011. ISBN 9781119991298. DOI: 10.1002/9781119991298.ch14file: Proof:e\EvangelistaMarchandPlumbleyV11-sound.pdf:PDF owner: markp timestamp: 2011.04.26file: Proof:e\EvangelistaMarchandPlumbleyV11-sound.pdf:PDF owner: markp timestamp: 2011.04.2

    A computational framework for sound segregation in music signals

    Get PDF
    Tese de doutoramento. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 200

    Audio source separation for music in low-latency and high-latency scenarios

    Get PDF
    Aquesta tesi proposa mètodes per tractar les limitacions de les tècniques existents de separació de fonts musicals en condicions de baixa i alta latència. En primer lloc, ens centrem en els mètodes amb un baix cost computacional i baixa latència. Proposem l'ús de la regularització de Tikhonov com a mètode de descomposició de l'espectre en el context de baixa latència. El comparem amb les tècniques existents en tasques d'estimació i seguiment dels tons, que són passos crucials en molts mètodes de separació. A continuació utilitzem i avaluem el mètode de descomposició de l'espectre en tasques de separació de veu cantada, baix i percussió. En segon lloc, proposem diversos mètodes d'alta latència que milloren la separació de la veu cantada, gràcies al modelatge de components específics, com la respiració i les consonants. Finalment, explorem l'ús de correlacions temporals i anotacions manuals per millorar la separació dels instruments de percussió i dels senyals musicals polifònics complexes.Esta tesis propone métodos para tratar las limitaciones de las técnicas existentes de separación de fuentes musicales en condiciones de baja y alta latencia. En primer lugar, nos centramos en los métodos con un bajo coste computacional y baja latencia. Proponemos el uso de la regularización de Tikhonov como método de descomposición del espectro en el contexto de baja latencia. Lo comparamos con las técnicas existentes en tareas de estimación y seguimiento de los tonos, que son pasos cruciales en muchos métodos de separación. A continuación utilizamos y evaluamos el método de descomposición del espectro en tareas de separación de voz cantada, bajo y percusión. En segundo lugar, proponemos varios métodos de alta latencia que mejoran la separación de la voz cantada, gracias al modelado de componentes que a menudo no se toman en cuenta, como la respiración y las consonantes. Finalmente, exploramos el uso de correlaciones temporales y anotaciones manuales para mejorar la separación de los instrumentos de percusión y señales musicales polifónicas complejas.This thesis proposes specific methods to address the limitations of current music source separation methods in low-latency and high-latency scenarios. First, we focus on methods with low computational cost and low latency. We propose the use of Tikhonov regularization as a method for spectrum decomposition in the low-latency context. We compare it to existing techniques in pitch estimation and tracking tasks, crucial steps in many separation methods. We then use the proposed spectrum decomposition method in low-latency separation tasks targeting singing voice, bass and drums. Second, we propose several high-latency methods that improve the separation of singing voice by modeling components that are often not accounted for, such as breathiness and consonants. Finally, we explore using temporal correlations and human annotations to enhance the separation of drums and complex polyphonic music signals

    Exploiting primitive grouping constraints for noise robust automatic speech recognition : studies with simultaneous speech.

    Get PDF
    Significant strides have been made in the field of automatic speech recognition over the past three decades. However, the systems are not robust; their performance degrades in the presence of even moderate amounts of noise. This thesis presents an approach to developing a speech recognition system that takes inspiration firom the approach of human speech recognition

    Adaptive time-frequency analysis for cognitive source separation

    Get PDF
    This thesis introduces a framework for separating two speech sources in non-ideal, reverberant environments. The source separation architecture tries to mimic the extraordinary abilities of the human auditory system when performing source separation. A movable human dummy head residing in a normal office room is used to model the conditions humans experience when listening to complex auditory scenes. This thesis first investigates how the orthogonality of speech sources in the time-frequency domain drops with different reverberation times of the environment and shows that separation schemes based on ideal binary time-frequency-masks are suitable to perform source separation also under humanoid reverberant conditions. Prior to separating the sources, the movable human dummy head analyzes the auditory scene and estimates the positions of the sources and the fundamental frequency tracks. The source localization is implemented using an iterative approach based on the interaural time differences between the two ears and achieves a localization blur of less than three degrees in the azimuth plane. The source separation architecture implemented in this thesis extracts the orthogonal timefrequency points of the speech mixtures. It combines the positive features of the STFT with the positive features of the cochleagram representation. The overall goal of the source separation is to find the ideal STFT-mask. The core source separation process however is based on the analysis of the corresponding region in an additionally computed cochleagram, which shows more reliable Interaural Time Difference (ITD) estimations that are used for separation. Several algorithms based on the ITD and the fundamental frequency of the target source are evaluated for their source separation capabilities. To enhance the separation capabilities of the single algorithms, the results of the different algorithms are combined to compute a final estimate. In this way SIR gains of approximately 30 dB for two source scenarios are achieved. For three source scenarios SIR gains of up to 16 dB are attained. Compared to the standard binaural signal processing approaches like DUET and Fixed Beamforming the presented approach achieves up to 29 dB SIR gain.Diese Dissertation beschreibt ein Framework zur Separation zweier Quellen in nicht-idealen, echobehafteten Umgebungen. Die Architektur zur Quellenseparation orientiert sich dabei an den außergewöhnlichen Separationsfähigkeiten des menschlichen Gehörs. Um die Bedingungen eines Menschen in einer komplexen auditiven Szene zu imitieren, wird ein beweglicher, menschlicher Kunstkopf genutzt, der sich in einem üblichen Büroraum befindet. In einem ersten Schritt analysiert diese Dissertation, inwiefern die Orthogonalität von Sprachsignalen im Zeit-Frequenz-Bereich mit unterschiedlichen Nachhallzeiten abnimmt. Trotz der Orthogonalitätsabnahme sind Separationsansätze basierend auf idealen binären Masken geeignet um eine Trennung von Sprachsignalen auch unter menschlichen, echobehafteten Bedingungen zu realisieren. Bevor die Quellen getrennt werden, analysiert der bewegliche Kunstkopf die auditive Szene und schätzt die Positionen der einzelnen Quellen und den Verlauf der Grundfrequenz der Sprecher ab. Die Quellenlokalisation wird durch einen iterativen Ansatz basierend auf den Zeitunterschieden zwischen beiden Ohren verwirklicht und erreicht eine Lokalisierungsgenauigkeit von weniger als drei Grad in der Azimuth-Ebene. Die Quellenseparationsarchitektur die in dieser Arbeit implementiert wird, extrahiert die orthogonalen Zeit-Frequenz-Punkte der Sprachmixturen. Dazu werden die positiven Eigenschaften der STFT mit den positiven Eigenschaften des Cochleagrams kombiniert. Ziel ist es, die ideale STFT-Maske zu finden. Die eigentliche Quellentrennung basiert jedoch auf der Analyse der entsprechenden Region eines zusätzlich berechneten Cochleagrams. Auf diese Weise wird eine weitaus verlässlichere Auswertung der Zeitunterschiede zwischen den beiden Ohren verwirklicht. Mehrere Algorithmen basierend auf den interauralen Zeitunterschieden und der Grundfrequenz der Zielquelle werden bezüglich ihrer Separationsfähigkeiten evaluiert. Um die Trennungsmöglichkeiten der einzelnen Algorithmen zu erhöhen, werden die einzelnen Ergebnisse miteinander verknüpft um eine finale Abschätzung zu gewinnen. Auf diese Weise können SIR Gewinne von ungefähr 30 dB für Szenarien mit zwei Quellen erzielt werden. Für Szenarien mit drei Quellen werden Gewinne von bis zu 16 dB erzielt. Verglichen mit binauralen Standardverfahren zur Quellentrennung wie DUET oder Fixed Beamforming, gewinnt der vorgestellte Ansatz bis zu 29 dB SIR

    A psychoacoustic engineering approach to machine sound source separation in reverberant environments

    Get PDF
    Reverberation continues to present a major problem for sound source separation algorithms, due to its corruption of many of the acoustical cues on which these algorithms rely. However, humans demonstrate a remarkable robustness to reverberation and many psychophysical and perceptual mechanisms are well documented. This thesis therefore considers the research question: can the reverberation–performance of existing psychoacoustic engineering approaches to machine source separation be improved? The precedence effect is a perceptual mechanism that aids our ability to localise sounds in reverberant environments. Despite this, relatively little work has been done on incorporating the precedence effect into automated sound source separation. Consequently, a study was conducted that compared several computational precedence models and their impact on the performance of a baseline separation algorithm. The algorithm included a precedence model, which was replaced with the other precedence models during the investigation. The models were tested using a novel metric in a range of reverberant rooms and with a range of other mixture parameters. The metric, termed Ideal Binary Mask Ratio, is shown to be robust to the effects of reverberation and facilitates meaningful and direct comparison between algorithms across different acoustic conditions. Large differences between the performances of the models were observed. The results showed that a separation algorithm incorporating a model based on interaural coherence produces the greatest performance gain over the baseline algorithm. The results from the study also indicated that it may be necessary to adapt the precedence model to the acoustic conditions in which the model is utilised. This effect is analogous to the perceptual Clifton effect, which is a dynamic component of the precedence effect that appears to adapt precedence to a given acoustic environment in order to maximise its effectiveness. However, no work has been carried out on adapting a precedence model to the acoustic conditions under test. Specifically, although the necessity for such a component has been suggested in the literature, neither its necessity nor benefit has been formally validated. Consequently, a further study was conducted in which parameters of each of the previously compared precedence models were varied in each room in order to identify if, and to what extent, the separation performance varied with these parameters. The results showed that the reverberation–performance of existing psychoacoustic engineering approaches to machine source separation can be improved and can yield significant gains in separation performance.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    • …
    corecore