30 research outputs found

    Musical instrument source separation in unison and monaural mixtures

    Get PDF
    Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2014.Thesis (Master's) -- Bilkent University, 2014.Includes bibliographical references leaves 52-55.Musical Instrument Source Separation aims to separate the individual instruments from a mixture. We work on mixtures where there are two instruments, playing the same constant pitch at the same time. When musical instruments play the same note, they overlap with each other and act as a single source. We use statistical source separation algorithms which perform separation by maximizing the statistical independence between the sources. A mixture can be recorded with more than one microphone; this enables us to extract spatial information of the instruments by comparing the recordings. However we work with the monaural case where there is one microphone. Some musical instruments have amplitude modulation and this modulation can be seen in the mixtures. We also aim to detect amplitude modulations to support the source separation success. We use NTF (Non-negative tensor factorization) to perform the separation. NTF separates the mixture into many components. These components should be clustered in order to synthesize the individual sources. We use k-means as well as manual clustering by comparing the SDR (Signal to Distortion Ratio) values of the components with the original sources.Ercan, Melik BerkanM.S

    Robust speech recognition with spectrogram factorisation

    Get PDF
    Communication by speech is intrinsic for humans. Since the breakthrough of mobile devices and wireless communication, digital transmission of speech has become ubiquitous. Similarly distribution and storage of audio and video data has increased rapidly. However, despite being technically capable to record and process audio signals, only a fraction of digital systems and services are actually able to work with spoken input, that is, to operate on the lexical content of speech. One persistent obstacle for practical deployment of automatic speech recognition systems is inadequate robustness against noise and other interferences, which regularly corrupt signals recorded in real-world environments. Speech and diverse noises are both complex signals, which are not trivially separable. Despite decades of research and a multitude of different approaches, the problem has not been solved to a sufficient extent. Especially the mathematically ill-posed problem of separating multiple sources from a single-channel input requires advanced models and algorithms to be solvable. One promising path is using a composite model of long-context atoms to represent a mixture of non-stationary sources based on their spectro-temporal behaviour. Algorithms derived from the family of non-negative matrix factorisations have been applied to such problems to separate and recognise individual sources like speech. This thesis describes a set of tools developed for non-negative modelling of audio spectrograms, especially involving speech and real-world noise sources. An overview is provided to the complete framework starting from model and feature definitions, advancing to factorisation algorithms, and finally describing different routes for separation, enhancement, and recognition tasks. Current issues and their potential solutions are discussed both theoretically and from a practical point of view. The included publications describe factorisation-based recognition systems, which have been evaluated on publicly available speech corpora in order to determine the efficiency of various separation and recognition algorithms. Several variants and system combinations that have been proposed in literature are also discussed. The work covers a broad span of factorisation-based system components, which together aim at providing a practically viable solution to robust processing and recognition of speech in everyday situations

    Coupled dictionary training for exemplar-based speech enhancement

    Full text link

    An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony

    Get PDF
    In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique

    An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony

    Get PDF
    In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique

    Pitch-Informed Solo and Accompaniment Separation

    Get PDF
    Das Thema dieser Dissertation ist die Entwicklung eines Systems zur Tonhöhen-informierten Quellentrennung von Musiksignalen in Soloinstrument und Begleitung. Dieses ist geeignet, die dominanten Instrumente aus einem Musikstück zu isolieren, unabhängig von der Art des Instruments, der Begleitung und Stilrichtung. Dabei werden nur einstimmige Melodieinstrumente in Betracht gezogen. Die Musikaufnahmen liegen monaural vor, es kann also keine zusätzliche Information aus der Verteilung der Instrumente im Stereo-Panorama gewonnen werden. Die entwickelte Methode nutzt Tonhöhen-Information als Basis für eine sinusoidale Modellierung der spektralen Eigenschaften des Soloinstruments aus dem Musikmischsignal. Anstatt die spektralen Informationen pro Frame zu bestimmen, werden in der vorgeschlagenen Methode Tonobjekte für die Separation genutzt. Tonobjekt-basierte Verarbeitung ermöglicht es, zusätzlich die Notenanfänge zu verfeinern, transiente Artefakte zu reduzieren, gemeinsame Amplitudenmodulation (Common Amplitude Modulation CAM) einzubeziehen und besser nichtharmonische Elemente der Töne abzuschätzen. Der vorgestellte Algorithmus zur Quellentrennung von Soloinstrument und Begleitung ermöglicht eine Echtzeitverarbeitung und ist somit relevant für den praktischen Einsatz. Ein Experiment zur besseren Modellierung der Zusammenhänge zwischen Magnitude, Phase und Feinfrequenz von isolierten Instrumententönen wurde durchgeführt. Als Ergebnis konnte die Kontinuität der zeitlichen Einhüllenden, die Inharmonizität bestimmter Musikinstrumente und die Auswertung des Phasenfortschritts für die vorgestellte Methode ausgenutzt werden. Zusätzlich wurde ein Algorithmus für die Quellentrennung in perkussive und harmonische Signalanteile auf Basis des Phasenfortschritts entwickelt. Dieser erreicht ein verbesserte perzeptuelle Qualität der harmonischen und perkussiven Signale gegenüber vergleichbaren Methoden nach dem Stand der Technik. Die vorgestellte Methode zur Klangquellentrennung in Soloinstrument und Begleitung wurde zu den Evaluationskampagnen SiSEC 2011 und SiSEC 2013 eingereicht. Dort konnten vergleichbare Ergebnisse im Hinblick auf perzeptuelle Bewertungsmaße erzielt werden. Die Qualität eines Referenzalgorithmus im Hinblick auf den in dieser Dissertation beschriebenen Instrumentaldatensatz übertroffen werden. Als ein Anwendungsszenario für die Klangquellentrennung in Solo und Begleitung wurde ein Hörtest durchgeführt, der die Qualitätsanforderungen an Quellentrennung im Kontext von Musiklernsoftware bewerten sollte. Die Ergebnisse dieses Hörtests zeigen, dass die Solo- und Begleitspur gemäß unterschiedlicher Qualitätskriterien getrennt werden sollten. Die Musiklernsoftware Songs2See integriert die vorgestellte Klangquellentrennung bereits in einer kommerziell erhältlichen Anwendung.This thesis addresses the development of a system for pitch-informed solo and accompaniment separation capable of separating main instruments from music accompaniment regardless of the musical genre of the track, or type of music accompaniment. For the solo instrument, only pitched monophonic instruments were considered in a single-channel scenario where no panning or spatial location information is available. In the proposed method, pitch information is used as an initial stage of a sinusoidal modeling approach that attempts to estimate the spectral information of the solo instrument from a given audio mixture. Instead of estimating the solo instrument on a frame by frame basis, the proposed method gathers information of tone objects to perform separation. Tone-based processing allowed the inclusion of novel processing stages for attack refinement, transient interference reduction, common amplitude modulation (CAM) of tone objects, and for better estimation of non-harmonic elements that can occur in musical instrument tones. The proposed solo and accompaniment algorithm is an efficient method suitable for real-world applications. A study was conducted to better model magnitude, frequency, and phase of isolated musical instrument tones. As a result of this study, temporal envelope smoothness, inharmonicty of musical instruments, and phase expectation were exploited in the proposed separation method. Additionally, an algorithm for harmonic/percussive separation based on phase expectation was proposed. The algorithm shows improved perceptual quality with respect to state-of-the-art methods for harmonic/percussive separation. The proposed solo and accompaniment method obtained perceptual quality scores comparable to other state-of-the-art algorithms under the SiSEC 2011 and SiSEC 2013 campaigns, and outperformed the comparison algorithm on the instrumental dataset described in this thesis.As a use-case of solo and accompaniment separation, a listening test procedure was conducted to assess separation quality requirements in the context of music education. Results from the listening test showed that solo and accompaniment tracks should be optimized differently to suit quality requirements of music education. The Songs2See application was presented as commercial music learning software which includes the proposed solo and accompaniment separation method

    調波音打楽器音分離による歌声のスペクトルゆらぎに基づく音楽信号処理の研究

    Get PDF
    学位の種別:課程博士University of Tokyo(東京大学

    End-to-end non-negative auto-encoders: a deep neural alternative to non-negative audio modeling

    Get PDF
    Over the last decade, non-negative matrix factorization (NMF) has emerged as one of the most popular approaches to modeling audio signals. NMF allows us to factorize the magnitude spectrogram to learn representative spectral bases that can be used for a wide range of applications. With the recent advances in deep learning, neural networks (NNs) have surpassed NMF in terms of performance. However, these NNs are trained discriminatively and lack several key characteristics like re-usability and robustness, compared to NMF. In this dissertation, we develop and investigate the idea of end-to-end non-negative autoencoders (NAEs) as an updated deep learning based alternative framework to non-negative audio modeling. We show that end-to-end NAEs combine the modeling advantages of non-negative matrix factorization and the generalizability of neural networks while delivering significant improvements in performance. To this end, we first interpret NMF as a NAE and show that the two approaches are equivalent semantically and in terms of source separation performance. We exploit the availability of sophisticated neural network architectures to propose several extensions to NAEs. We also demonstrate that these modeling improvements significantly boost the performance of NAEs. In audio processing applications, the short-time fourier transform~(STFT) is used as a universal first step and we design algorithms and neural networks to operate on the magnitude spectrograms. We interpret the sequence of steps involved in computing the STFT as additional neural network layers. This enables us to propose end-to-end processing pipelines that operate directly on the raw waveforms. In the context of source separation, we show that end-to-end processing gives a significant improvement in performance compared to existing spectrogram based methods. Furthermore, to train these end-to-end models, we investigate the use of cost functions that are derived from objective evaluation metrics as measured on waveforms. We present subjective listening test results that reveal insights into the performance of these cost functions for end-to-end source separation. Combining the adaptive front-end layers with NAEs, we propose end-to-end NAEs and show how they can be used for end-to-end generative source separation. Our experiments indicate that these models deliver separation performance comparable to that of discriminative NNs, while retaining the modularity of NMF and the modeling flexibility of neural networks. Finally, we present an approach to train these end-to-end NAEs using mixtures only, without access to clean training examples
    corecore