26 research outputs found

    Multichannel Speech Separation and Enhancement Using the Convolutive Transfer Function

    Get PDF
    This paper addresses the problem of speech separation and enhancement from multichannel convolutive and noisy mixtures, \emph{assuming known mixing filters}. We propose to perform the speech separation and enhancement task in the short-time Fourier transform domain, using the convolutive transfer function (CTF) approximation. Compared to time-domain filters, CTF has much less taps, consequently it has less near-common zeros among channels and less computational complexity. The work proposes three speech-source recovery methods, namely: i) the multichannel inverse filtering method, i.e. the multiple input/output inverse theorem (MINT), is exploited in the CTF domain, and for the multi-source case, ii) a beamforming-like multichannel inverse filtering method applying single source MINT and using power minimization, which is suitable whenever the source CTFs are not all known, and iii) a constrained Lasso method, where the sources are recovered by minimizing the ℓ1\ell_1-norm to impose their spectral sparsity, with the constraint that the ℓ2\ell_2-norm fitting cost, between the microphone signals and the mixing model involving the unknown source signals, is less than a tolerance. The noise can be reduced by setting a tolerance onto the noise power. Experiments under various acoustic conditions are carried out to evaluate the three proposed methods. The comparison between them as well as with the baseline methods is presented.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processin

    Blind MultiChannel Identification and Equalization for Dereverberation and Noise Reduction based on Convolutive Transfer Function

    Get PDF
    This paper addresses the problems of blind channel identification and multichannel equalization for speech dereverberation and noise reduction. The time-domain cross-relation method is not suitable for blind room impulse response identification, due to the near-common zeros of the long impulse responses. We extend the cross-relation method to the short-time Fourier transform (STFT) domain, in which the time-domain impulse responses are approximately represented by the convolutive transfer functions (CTFs) with much less coefficients. The CTFs suffer from the common zeros caused by the oversampled STFT. We propose to identify CTFs based on the STFT with the oversampled signals and the critical sampled CTFs, which is a good compromise between the frequency aliasing of the signals and the common zeros problem of CTFs. In addition, a normalization of the CTFs is proposed to remove the gain ambiguity across sub-bands. In the STFT domain, the identified CTFs is used for multichannel equalization, in which the sparsity of speech signals is exploited. We propose to perform inverse filtering by minimizing the ℓ1\ell_1-norm of the source signal with the relaxed ℓ2\ell_2-norm fitting error between the micophone signals and the convolution of the estimated source signal and the CTFs used as a constraint. This method is advantageous in that the noise can be reduced by relaxing the ℓ2\ell_2-norm to a tolerance corresponding to the noise power, and the tolerance can be automatically set. The experiments confirm the efficiency of the proposed method even under conditions with high reverberation levels and intense noise.Comment: 13 pages, 5 figures, 5 table

    An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony

    Get PDF
    In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique

    An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony

    Get PDF
    In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique

    Efficient Acquisition and Denoising of Full-Range Event-Related Potentials Following Transient Stimulation of the Auditory Pathway

    Get PDF
    This body of work relates to recent advances in the field of human auditory event-related potentials (ERP), specifically the fast, deconvolution-based ERP acquisition as well as single-response based preprocessing, denoising and subsequent analysis methods. Its goal is the contribution of a cohesive set of methods facilitating the fast, reliable acquisition of the whole electrophysiological response generated by the auditory pathway from the brainstem to the cortex following transient acoustical stimulation. The present manuscript is divided into three sequential areas of investigation : First, the general feasibility of simultaneously acquiring auditory brainstem, middle-latency and late ERP single responses is demonstrated using recordings from 15 normal hearing subjects. Favourable acquisition parameters (i.e., sampling rate, bandpass filter settings and interstimulus intervals) are established, followed by signal analysis of the resulting ERP in terms of their dominant intrinsic scales to determine the properties of an optimal signal representation with maximally reduced sample count by means of nonlinear resampling on a logarithmic timebase. This way, a compression ratio of 16.59 is achieved. Time-scale analysis of the linear-time and logarithmic-time ERP single responses is employed to demonstrate that no important information is lost during compressive resampling, which is additionally supported by a comparative evaluation of the resulting average waveforms - here, all prominent waves remain visible, with their characteristic latencies and amplitudes remaining essentially unaffected by the resampling process. The linear-time and resampled logarithmic-time signal representations are comparatively investigated regarding their susceptibility to the types of physiological and technical noise frequently contaminating ERP recordings. While in principle there already exists a plethora of well-investigated approaches towards the denoising of ERP single-response representations to improve signal quality and/or reduce necessary aquisition times, the substantially altered noise characteristics of the obtained, resampled logarithmic-time single response representations as opposed to their linear-time equivalent necessitates a reevaluation of the available methods on this type of data. Additionally, two novel, efficient denoising algorithms based on transform coefficient manipulation in the sinogram domain and on an analytic, discrete wavelet filterbank are proposed and subjected to a comparative performance evaluation together with two established denoising methods. To facilitate a thorough comparison, the real-world ERP dataset obtained in the first part of this work is employed alongside synthetic data generated using a phenomenological ERP model evaluated at different signal-to-noise ratios (SNR), with individual gains in multiple outcome metrics being used to objectively assess algorithm performances. Results suggest the proposed denoising algorithms to substantially outperform the state-of-the-art methods in terms of the employed outcome metrics as well as their respective processing times. Furthermore, an efficient stimulus sequence optimization method for use with deconvolution-based ERP acquisition methods is introduced, which achieves consistent noise attenuation within a broad designated frequency range. A novel stimulus presentation paradigm for the fast, interleaved acquisition of auditory brainstem, middle-latency and late responses featuring alternating periods of optimized, high-rate deconvolution sequences and subsequent low-rate stimulation is proposed and investigated in 20 normal hearing subjects. Deconvolved sequence responses containing early and middle-latency ERP components are fused with subsequent late responses using a time-frequency resolved weighted averaging method based on cross-trial regularity, yielding a uniform SNR of the full-range auditory ERP across investigated timescales. Obtained average ERP waveforms exhibit morphologies consistent with both literature values and the reference recordings obtained in the first part of this manuscript, with all prominent waves being visible in the grand average waveforms. The novel stimulation approach cuts acquisition time by a factor of 3.4 while at the same time yielding a substantial gain in the SNR of obtained ERP data. Results suggest the proposed interleaved stimulus presentation and associated postprocessing methodology to be suitable for the fast, reliable extraction of full-range neural correlates of auditory processing in future studies.Diese Arbeit steht im Zusammenhang mit aktuellen Entwicklungen auf dem Gebiet der ereigniskorrelierten Potentiale (EKP) des humanen auditorischen Systems, insbesondere der schnellen, entfaltungsbasierten EKP-Aufzeichnung sowie einzelantwortbasierten Vorverarbeitungs-, Entrauschungs- und nachgelagerten Analysemethoden. Ziel ist die Bereitstellung eines vollstĂ€ndigen Methodensatzes, der eine schnelle, zuverlĂ€ssige Erfassung der gesamten elektrophysiologischen AktivitĂ€t entlang der Hörbahn vom Hirnstamm bis zum Cortex ermöglicht, die als Folge transienter akustischer Stimulation auftritt. Das vorliegende Manuskript gliedert sich in drei aufeinander aufbauende Untersuchungsbereiche : ZunĂ€chst wird die generelle Machbarkeit der gleichzeitigen Aufzeichnung von Einzelantworten der auditorischen Hirnstammpotentiale zusammen mit mittelspĂ€ten und spĂ€ten EKP anhand von Referenzmessungen an 15 normalhörenden Probanden demonstriert. Es werden hierzu geeignete Erfassungsparameter (Abtastrate, Bandpassfiltereinstellungen und Interstimulusintervalle) ermittelt, gefolgt von einer Signalanalyse der resultierenden EKP im Hinblick auf deren dominante intrinsische Skalen, um auf dieser Grundlage die Eigenschaften einer optimalen Signaldarstellung mit maximal reduzierter Anzahl an Abtastpunkten zu bestimmen, die durch nichtlineare Neuabtastung auf eine logarithmische Zeitbasis realisiert wird. Hierbei wird ein KompressionsverhĂ€ltnis von 16.59 erzielt. Zeit-Skalen-Analysen der uniform und logarithmisch abgetasteten EKP-Einzelantworten zeigen, dass bei der kompressiven Neuabtastung keine relevante Information verloren geht, was durch eine vergleichende Auswertung der resultierenden, gemittelten Wellenformen zusĂ€tzlich gestĂŒtzt wird - alle prominenten Wellen bleiben sichtbar und sind hinsichtlich ihrer charakteristischen Latenzen und Amplituden von der Neuabtastung weitgehend unbeeinflusst. Die uniforme und logarithmische SignalreprĂ€sentation werden hinsichtlich ihrer AnfĂ€lligkeit fĂŒr die ĂŒblicherweise bei der EKP-Aufzeichnung auftretenden physiologischen und technischen Störquellen vergleichend untersucht. Obwohl bereits eine FĂŒlle von gut etablierten AnsĂ€tzen fĂŒr die Entrauschung von EKP-Einzelantwortdarstellungen zur Verbesserung der SignalqualitĂ€t und/oder zur Reduktion der benötigten Erfassungszeiten existiert, erfordern die wesentlich verĂ€nderten Störeigenschaften der vorliegenden, logarithmisch abgetasteten Einzelantwortdarstellungen im Gegensatz zu ihrem uniformen Äquivalent eine Neubewertung der verfĂŒgbaren Methoden fĂŒr diese Art von Daten. DarĂŒber hinaus werden zwei neuartige, effiziente Entrauschungsalgorithmen geboten, die auf der Koeffizientenmanipulation einer Sinogramm-ReprĂ€sentation bzw. einer analytischen, diskreten Wavelet-Zerlegung der Einzelantworten basieren und gemeinsam mit zwei etablierten Entrauschungsmethoden einer vergleichenden Leistungsbewertung unterzogen werden. Um einen umfassenden Vergleich zu ermöglichen, werden der im ersten Teil dieser Arbeit erhaltene EKP-Messdatensatz sowie synthetischen Daten eingesetzt, die mithilfe eines phĂ€nomenologischen EKP-Modells bei verschiedenen Signal-Rausch-AbstĂ€nden (SRA) erzeugt wurden, wobei die individuellen Anstiege in mehreren Zielmetriken zur objektiven Bewertung der Performanz herangezogen werden. Die erhaltenen Ergebnisse deuten darauf hin, dass die vorgeschlagenen Entrauschungsalgorithmen die etablierten Methoden sowohl in den eingesetzten Zielmetriken als auch mit Blick auf die Laufzeiten deutlich ĂŒbertreffen. Weiterhin wird ein effizientes Reizsequenzoptimierungsverfahren fĂŒr den Einsatz mit entfaltungsbasierten EKP-Aufzeichnungsmethoden vorgestellt, das eine konsistente RauschunterdrĂŒckung innerhalb eines breiten Frequenzbands erreicht. Ein neuartiges Stimulus-PrĂ€sentationsparadigma fĂŒr die schnelle, verschachtelte Erfassung auditorischer Hirnstammpotentiale, mittlelspĂ€ter und spĂ€ter Antworten durch alternierende Darbietung von optimierten, dichter Stimulussequenzen und nachgelagerter, langsamer Einzelstimulation wird eingefĂŒhrt und in 20 normalhörenden Probanden evaluiert. Entfaltete Sequenzantworten, die frĂŒhe und mittlere EKP enthalten, werden mit den nachfolgenden spĂ€ten Antworten fusioniert, wobei eine Zeit-Frequenz-aufgelöste, gewichtete Mittelung unter BerĂŒcksichtigung von RegularitĂ€t ĂŒber Einzelantworten hinweg zum Einsatz kommt. Diese erreicht einheitliche SRA der resultierenden EKP-Signale ĂŒber alle untersuchten Zeitskalen hinweg. Die erhaltenen, gemittelten EKP-Wellenformen weisen Morphologien auf, die sowohl mit einschlĂ€gigen Literaturwerten als auch mit den im ersten Teil dieses Manuskripts erhaltenen Referenzaufnahmen konsistent sind, wobei alle markanten Wellen deutlich in den Gesamtmittelwerten sichtbar sind. Das neuartige Stimulationsparadigma verkĂŒrzt die Erfassungszeit um den Faktor 3.4 und vergrĂ¶ĂŸert gleichzeitig den erreichten SRA erheblich. Die Ergebnisse deuten darauf hin, dass die vorgeschlagene verschachtelte StimulusprĂ€sentation und die nachgelagerte EKP-Verarbeitungsmethodik zur schnellen, zuverlĂ€ssigen Extraktion neuronaler Korrelate der gesamten auditorischen Verarbeitung im Rahmen zukĂŒnftiger Studien geeignet sind.Bundesministerium fĂŒr Bildung und Forschung | Bimodal Fusion - Eine neurotechnologische Optimierungsarchitektur fĂŒr integrierte bimodale Hörsysteme | 2016-201

    Replay detection in voice biometrics: an investigation of adaptive and non-adaptive front-ends

    Full text link
    Among various physiological and behavioural traits, speech has gained popularity as an effective mode of biometric authentication. Even though they are gaining popularity, automatic speaker verification systems are vulnerable to malicious attacks, known as spoofing attacks. Among various types of spoofing attacks, replay attack poses the biggest threat due to its simplicity and effectiveness. This thesis investigates the importance of 1) improving front-end feature extraction via novel feature extraction techniques and 2) enhancing spectral components via adaptive front-end frameworks to improve replay attack detection. This thesis initially focuses on AM-FM modelling techniques and their use in replay attack detection. A novel method to extract the sub-band frequency modulation (FM) component using the spectral centroid of a signal is proposed, and its use as a potential acoustic feature is also discussed. Frequency Domain Linear Prediction (FDLP) is explored as a method to obtain the temporal envelope of a speech signal. The temporal envelope carries amplitude modulation (AM) information of speech resonances. Several features are extracted from the temporal envelope and the FDLP residual signal. These features are then evaluated for replay attack detection and shown to have significant capability in discriminating genuine and spoofed signals. Fusion of AM and FM-based features has shown that AM and FM carry complementary information that helps distinguish replayed signals from genuine ones. The importance of frequency band allocation when creating filter banks is studied as well to further advance the understanding of front-ends for replay attack detection. Mechanisms inspired by the human auditory system that makes the human ear an excellent spectrum analyser have been investigated and integrated into front-ends. Spatial differentiation, a mechanism that provides additional sharpening to auditory filters is one of them that is used in this work to improve the selectivity of the sub-band decomposition filters. Two features are extracted using the improved filter bank front-end: spectral envelope centroid magnitude (SECM) and spectral envelope centroid frequency (SECF). These are used to establish the positive effect of spatial differentiation on discriminating spoofed signals. Level-dependent filter tuning, which allows the ear to handle a large dynamic range, is integrated into the filter bank to further improve the front-end. This mechanism converts the filter bank into an adaptive one where the selectivity of the filters is varied based on the input signal energy. Experimental results show that this leads to improved spoofing detection performance. Finally, deep neural network (DNN) mechanisms are integrated into sub-band feature extraction to develop an adaptive front-end that adjusts its characteristics based on the sub-band signals. A DNN-based controller that takes sub-band FM components as input, is developed to adaptively control the selectivity and sensitivity of a parallel filter bank to enhance the artifacts that differentiate a replayed signal from a genuine signal. This work illustrates gradient-based optimization of a DNN-based controller using the feedback from a spoofing detection back-end classifier, thus training it to reduce spoofing detection error. The proposed framework has displayed a superior ability in identifying high-quality replayed signals compared to conventional non-adaptive frameworks. All techniques proposed in this thesis have been evaluated on well-established databases on replay attack detection and compared with state-of-the-art baseline systems

    Rethinking Wireless: Building Next-Generation Networks

    Full text link
    We face a growing challenge to the design, deployment and management of wireless networks that largely stems from the need to operate in an increasingly spectrum-sparse environment, the need for greater concurrency among devices and the need for greater coordination between heterogeneous wireless protocols. Unfortunately, our current wireless networks lack interoperability, are deployed with fixed functions, and omit easy programmability and extensibility from their key design requirements. In this dissertation, we study the design of next-generation wireless networks and analyze the individual components required to build such an infrastructure. Re-designing a wireless architecture must be undertaken carefully to balance new and coordinated multipoint (CoMP) techniques with the backward compatibility necessary to support the large number of existing devices. These next-generation wireless networks will be predominantly software-defined and will have three components: (a) a wireless component that consists of software-defined radio resource units (RRUs) or access points (APs); (b) a software-defined backhaul control plane that manages the transfer of RF data between the RRUs and the centralized processing resource; and (c) a centralized datacenter/cloud compute resource that processes RF signal data from all attached RRUs. The dissertation addresses the following four key problems in next-generation networks: (1) Making Existing Wireless Devices Spectrum-Agile, (2) Cooperative Compression of the Wireless Backhaul, (3) Spectrum Coordination and (4) Spectrum Coordination.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/102341/1/zontar_1.pd

    Speech Enhancement with Improved Deep Learning Methods

    Get PDF
    In real-world environments, speech signals are often corrupted by ambient noises during their acquisition, leading to degradation of quality and intelligibility of the speech for a listener. As one of the central topics in the speech processing area, speech enhancement aims to recover clean speech from such a noisy mixture. Many traditional speech enhancement methods designed based on statistical signal processing have been proposed and widely used in the past. However, the performance of these methods was limited and thus failed in sophisticated acoustic scenarios. Over the last decade, deep learning as a primary tool to develop data-driven information systems has led to revolutionary advances in speech enhancement. In this context, speech enhancement is treated as a supervised learning problem, which does not suffer from issues faced by traditional methods. This supervised learning problem has three main components: input features, learning machine, and training target. In this thesis, various deep learning architectures and methods are developed to deal with the current limitations of these three components. First, we propose a serial hybrid neural network model integrating a new low-complexity fully-convolutional convolutional neural network (CNN) and a long short-term memory (LSTM) network to estimate a phase-sensitive mask for speech enhancement. Instead of using traditional acoustic features as the input of the model, a CNN is employed to automatically extract sophisticated speech features that can maximize the performance of a model. Then, an LSTM network is chosen as the learning machine to model strong temporal dynamics of speech. The model is designed to take full advantage of the temporal dependencies and spectral correlations present in the input speech signal while keeping the model complexity low. Also, an attention technique is embedded to recalibrate the useful CNN-extracted features adaptively. Through extensive comparative experiments, we show that the proposed model significantly outperforms some known neural network-based speech enhancement methods in the presence of highly non-stationary noises, while it exhibits a relatively small number of model parameters compared to some commonly employed DNN-based methods. Most of the available approaches for speech enhancement using deep neural networks face a number of limitations: they do not exploit the information contained in the phase spectrum, while their high computational complexity and memory requirements make them unsuited for real-time applications. Hence, a new phase-aware composite deep neural network is proposed to address these challenges. Specifically, magnitude processing with spectral mask and phase reconstruction using phase derivative are proposed as key subtasks of the new network to simultaneously enhance the magnitude and phase spectra. Besides, the neural network is meticulously designed to take advantage of strong temporal and spectral dependencies of speech, while its components perform independently and in parallel to speed up the computation. The advantages of the proposed PACDNN model over some well-known DNN-based SE methods are demonstrated through extensive comparative experiments. Considering that some acoustic scenarios could be better handled using a number of low-complexity sub-DNNs, each specifically designed to perform a particular task, we propose another very low complexity and fully convolutional framework, performing speech enhancement in short-time modified discrete cosine transform (STMDCT) domain. This framework is made up of two main stages: classification and mapping. In the former stage, a CNN-based network is proposed to classify the input speech based on its utterance-level attributes, i.e., signal-to-noise ratio and gender. In the latter stage, four well-trained CNNs specialized for different specific and simple tasks transform the STMDCT of noisy input speech to the clean one. Since this framework is designed to perform in the STMDCT domain, there is no need to deal with the phase information, i.e., no phase-related computation is required. Moreover, the training target length is only one-half of those in the previous chapters, leading to lower computational complexity and less demand for the mapping CNNs. Although there are multiple branches in the model, only one of the expert CNNs is active for each time, i.e., the computational burden is related only to a single branch at anytime. Also, the mapping CNNs are fully convolutional, and their computations are performed in parallel, thus reducing the computational time. Moreover, this proposed framework reduces the latency by %55 compared to the models in the previous chapters. Through extensive experimental studies, it is shown that the MBSE framework not only gives a superior speech enhancement performance but also has a lower complexity compared to some existing deep learning-based methods

    Acoustic anomaly detection using robust statistical energy processing

    Get PDF
    An anomaly is the specific event that causes the violation of a process observer's expectations about the process under observation. In this work, the problem of spatially locating an acoustic anomaly is addressed. Once reduced to a problem in robust statistics, an automated observer is designed to detect when high energy sources are introduced into an acoustic scene. Accounting for potential energy from signal amplitude, and kinetic energy from signal frequency in wavelet-filtered sub-bands, an outlier a robust statistical characterization scheme was developed using the Teager energy operator. With a statistical expectation of energy content in sub-bands, a methodology is designed to detect signal energies that violate the statistical expectation. These minor anomalies provide some sense that a fundamental change in energy has occurred in the sub-band. By examining how the signal is changing across all sub-bands, a detector is designed that is able to determine when a fundamental change occurs in the sub-band signal trends. Minor anomalies occurring during such changes are labeled as major anomalies. Using established localization methods, position estimates are obtained for the major anomalies in each sub-band. Accounting for the possibility of a source with spatiotemporal properties, the median of sub-band position estimates provides the final spatial information about the source

    Trennung und SchĂ€tzung der Anzahl von Audiosignalquellen mit Zeit- und FrequenzĂŒberlappung

    Get PDF
    Everyday audio recordings involve mixture signals: music contains a mixture of instruments; in a meeting or conference, there is a mixture of human voices. For these mixtures, automatically separating or estimating the number of sources is a challenging task. A common assumption when processing mixtures in the time-frequency domain is that sources are not fully overlapped. However, in this work we consider some cases where the overlap is severe — for instance, when instruments play the same note (unison) or when many people speak concurrently ("cocktail party") — highlighting the need for new representations and more powerful models. To address the problems of source separation and count estimation, we use conventional signal processing techniques as well as deep neural networks (DNN). We ïŹrst address the source separation problem for unison instrument mixtures, studying the distinct spectro-temporal modulations caused by vibrato. To exploit these modulations, we developed a method based on time warping, informed by an estimate of the fundamental frequency. For cases where such estimates are not available, we present an unsupervised model, inspired by the way humans group time-varying sources (common fate). This contribution comes with a novel representation that improves separation for overlapped and modulated sources on unison mixtures but also improves vocal and accompaniment separation when used as an input for a DNN model. Then, we focus on estimating the number of sources in a mixture, which is important for real-world scenarios. Our work on count estimation was motivated by a study on how humans can address this task, which lead us to conduct listening experiments, conïŹrming that humans are only able to estimate the number of up to four sources correctly. To answer the question of whether machines can perform similarly, we present a DNN architecture, trained to estimate the number of concurrent speakers. Our results show improvements compared to other methods, and the model even outperformed humans on the same task. In both the source separation and source count estimation tasks, the key contribution of this thesis is the concept of “modulation”, which is important to computationally mimic human performance. Our proposed Common Fate Transform is an adequate representation to disentangle overlapping signals for separation, and an inspection of our DNN count estimation model revealed that it proceeds to ïŹnd modulation-like intermediate features.Im Alltag sind wir von gemischten Signalen umgeben: Musik besteht aus einer Mischung von Instrumenten; in einem Meeting oder auf einer Konferenz sind wir einer Mischung menschlicher Stimmen ausgesetzt. FĂŒr diese Mischungen ist die automatische Quellentrennung oder die Bestimmung der Anzahl an Quellen eine anspruchsvolle Aufgabe. Eine hĂ€uïŹge Annahme bei der Verarbeitung von gemischten Signalen im Zeit-Frequenzbereich ist, dass die Quellen sich nicht vollstĂ€ndig ĂŒberlappen. In dieser Arbeit betrachten wir jedoch einige FĂ€lle, in denen die Überlappung immens ist zum Beispiel, wenn Instrumente den gleichen Ton spielen (unisono) oder wenn viele Menschen gleichzeitig sprechen (Cocktailparty) —, so dass neue Signal-ReprĂ€sentationen und leistungsfĂ€higere Modelle notwendig sind. Um die zwei genannten Probleme zu bewĂ€ltigen, verwenden wir sowohl konventionelle Signalverbeitungsmethoden als auch tiefgehende neuronale Netze (DNN). Wir gehen zunĂ€chst auf das Problem der Quellentrennung fĂŒr Unisono-Instrumentenmischungen ein und untersuchen die speziellen, durch Vibrato ausgelösten, zeitlich-spektralen Modulationen. Um diese Modulationen auszunutzen entwickelten wir eine Methode, die auf Zeitverzerrung basiert und eine SchĂ€tzung der Grundfrequenz als zusĂ€tzliche Information nutzt. FĂŒr FĂ€lle, in denen diese SchĂ€tzungen nicht verfĂŒgbar sind, stellen wir ein unĂŒberwachtes Modell vor, das inspiriert ist von der Art und Weise, wie Menschen zeitverĂ€nderliche Quellen gruppieren (Common Fate). Dieser Beitrag enthĂ€lt eine neuartige ReprĂ€sentation, die die Separierbarkeit fĂŒr ĂŒberlappte und modulierte Quellen in Unisono-Mischungen erhöht, aber auch die Trennung in Gesang und Begleitung verbessert, wenn sie in einem DNN-Modell verwendet wird. Im Weiteren beschĂ€ftigen wir uns mit der SchĂ€tzung der Anzahl von Quellen in einer Mischung, was fĂŒr reale Szenarien wichtig ist. Unsere Arbeit an der SchĂ€tzung der Anzahl war motiviert durch eine Studie, die zeigt, wie wir Menschen diese Aufgabe angehen. Dies hat uns dazu veranlasst, eigene Hörexperimente durchzufĂŒhren, die bestĂ€tigten, dass Menschen nur in der Lage sind, die Anzahl von bis zu vier Quellen korrekt abzuschĂ€tzen. Um nun die Frage zu beantworten, ob Maschinen dies Ă€hnlich gut können, stellen wir eine DNN-Architektur vor, die erlernt hat, die Anzahl der gleichzeitig sprechenden Sprecher zu ermitteln. Die Ergebnisse zeigen Verbesserungen im Vergleich zu anderen Methoden, aber vor allem auch im Vergleich zu menschlichen Hörern. Sowohl bei der Quellentrennung als auch bei der SchĂ€tzung der Anzahl an Quellen ist ein Kernbeitrag dieser Arbeit das Konzept der “Modulation”, welches wichtig ist, um die Strategien von Menschen mittels Computern nachzuahmen. Unsere vorgeschlagene Common Fate Transformation ist eine adĂ€quate Darstellung, um die Überlappung von Signalen fĂŒr die Trennung zugĂ€nglich zu machen und eine Inspektion unseres DNN-ZĂ€hlmodells ergab schließlich, dass sich auch hier modulationsĂ€hnliche Merkmale ïŹnden lassen
    corecore