29 research outputs found
Reverberation: models, estimation and application
The use of reverberation models is required in many applications such as acoustic measurements,
speech dereverberation and robust automatic speech recognition. The aim of this thesis is to
investigate different models and propose a perceptually-relevant reverberation model with suitable
parameter estimation techniques for different applications.
Reverberation can be modelled in both the time and frequency domain. The model parameters
give direct information of both physical and perceptual characteristics. These characteristics
create a multidimensional parameter space of reverberation, which can be to a large extent captured
by a time-frequency domain model. In this thesis, the relationship between physical and perceptual
model parameters will be discussed. In the first application, an intrusive technique is proposed to
measure the reverberation or reverberance, perception of reverberation and the colouration. The
room decay rate parameter is of particular interest.
In practical applications, a blind estimate of the decay rate of acoustic energy in a room
is required. A statistical model for the distribution of the decay rate of the reverberant signal
named the eagleMax distribution is proposed. The eagleMax distribution describes the reverberant
speech decay rates as a random variable that is the maximum of the room decay rates and anechoic
speech decay rates. Three methods were developed to estimate the mean room decay rate from
the eagleMax distributions alone. The estimated room decay rates form a reverberation model that
will be discussed in the context of room acoustic measurements, speech dereverberation and robust
automatic speech recognition individually
System Identification with Applications in Speech Enhancement
As the increasing popularity of integrating hands-free telephony on mobile portable devices
and the rapid development of voice over internet protocol, identification of acoustic
systems has become desirable for compensating distortions introduced to speech signals
during transmission, and hence enhancing the speech quality. The objective of this research
is to develop system identification algorithms for speech enhancement applications
including network echo cancellation and speech dereverberation.
A supervised adaptive algorithm for sparse system identification is developed for
network echo cancellation. Based on the framework of selective-tap updating scheme
on the normalized least mean squares algorithm, the MMax and sparse partial update
tap-selection strategies are exploited in the frequency domain to achieve fast convergence
performance with low computational complexity. Through demonstrating how
the sparseness of the network impulse response varies in the transformed domain, the
multidelay filtering structure is incorporated to reduce the algorithmic delay.
Blind identification of SIMO acoustic systems for speech dereverberation in the
presence of common zeros is then investigated. First, the problem of common zeros is
defined and extended to include the presence of near-common zeros. Two clustering algorithms
are developed to quantify the number of these zeros so as to facilitate the study
of their effect on blind system identification and speech dereverberation. To mitigate such
effect, two algorithms are developed where the two-stage algorithm based on channel
decomposition identifies common and non-common zeros sequentially; and the forced
spectral diversity approach combines spectral shaping filters and channel undermodelling
for deriving a modified system that leads to an improved dereverberation performance.
Additionally, a solution to the scale factor ambiguity problem in subband-based blind system identification is developed, which motivates further research on subbandbased
dereverberation techniques. Comprehensive simulations and discussions demonstrate
the effectiveness of the aforementioned algorithms. A discussion on possible directions
of prospective research on system identification techniques concludes this thesis
Speech Dereverberation Based on Multi-Channel Linear Prediction
Room reverberation can severely degrade the auditory quality and intelligibility of the speech signals received by distant microphones in an enclosed environment. In recent years, various dereverberation algorithms have been developed to tackle this problem, such as beamforming and inverse filtering of the room transfer function. However, this kind of methods relies heavily on the precise estimation of either the direction of arrival (DOA) or room acoustic characteristics. Thus, their performance is very much limited. A more promising category of dereverberation algorithms has been developed based on multi-channel linear predictor (MCLP). This idea was first proposed in time domain where speech signal is highly correlated in a short period of time. To ensure a good suppression of the reverberation, the prediction filter length is required to be longer than the reverberation time. As a result, the complexity of this algorithm is often unacceptable because of large covariance matrix calculation. To overcome this disadvantage, this thesis focuses on the MCLP dereverberation methods performed in the short-time Fourier transform (STFT) domain.
Recently, the weighted prediction error (WPE) algorithm has been developed and widely applied to speech dereverberation. In WPE algorithm, MCLP is used in the STFT domain to estimate the late reverberation components from previous frames of the reverberant speech. The enhanced speech is obtained by subtracting the late reverberation from the reverberant speech. Each STFT coefficient is assumed to be independent and obeys Gaussian distribution. A maximum likelihood (ML) problem is formulated in each frequency bin to calculate the predictor coefficients. In this thesis, the original WPE algorithm is improved in two aspects. First, two advanced statistical models, generalized Gaussian distribution (GGD) and Laplacian distribution, are employed instead of the classic Gaussian distribution. Both of them are shown to give better modeling of the histogram of the clean speech. Second, we focus on improving the estimation of the variances of the STFT coefficients of the desired signal. In the original WPE algorithm, the variances are estimated in each frequency bin independently without considering the cross-frequency correlation. Thus, we integrate the nonnegative matrix factorization (NMF) into the WPE algorithm to refine the estimation of the variances and hence obtain a better dereverberation performance.
Another category of MCLP based dereverberation algorithm has been proposed in literature by exploiting the sparsity of the STFT coefficients of the desired signal for calculating the predictor coefficients. In this thesis, we also investigate an efficient algorithm based on the maximization of the group sparsity of desired signal using mixed norms. Inspired by the idea of sparse linear predictor (SLP), we propose to include a sparse constraint for the predictor coefficients in order to further improve the dereverberation performance. A weighting parameter is also introduced to achieve a trade-off between the sparsity of the desired signal and the predictor coefficients.
Computer simulation of the proposed dereverberation algorithms is conducted. Our experimental results show that the proposed algorithms can significantly improve the quality of reverberant speech signal under different reverberation times. Subjective evaluation also gives a more intuitive demonstration of the enhanced speech intelligibility. Performance comparison also shows that our algorithms outperform some of the state-of-the-art dereverberation techniques
Spatial dissection of a soundfield using spherical harmonic decomposition
A real-world soundfield is often contributed by multiple desired and undesired sound sources. The performance of many acoustic systems such as automatic speech recognition, audio surveillance, and teleconference relies on its ability to extract the desired sound components in such a mixed environment. The existing solutions to the above problem are constrained by various fundamental limitations and require to enforce different priors depending on the acoustic condition such as reverberation and spatial distribution of sound sources. With the growing emphasis and integration of audio applications in diverse technologies such as smart home and virtual reality appliances, it is imperative to advance the source separation technology in order to overcome the limitations of the traditional approaches.
To that end, we exploit the harmonic decomposition model to dissect a mixed soundfield into its underlying desired and undesired components based on source and signal characteristics. By analysing the spatial projection of a soundfield, we achieve multiple outcomes such as (i) soundfield separation with respect to distinct source regions, (ii) source separation in a mixed soundfield using modal coherence model, and (iii) direction of arrival (DOA) estimation of multiple overlapping sound sources through pattern recognition of the modal coherence of a soundfield.
We first employ an array of higher order microphones for soundfield separation in order to reduce hardware requirement and implementation complexity. Subsequently, we develop novel mathematical models for modal coherence of noisy and reverberant soundfields that facilitate convenient ways for estimating DOA and power spectral densities leading to robust source separation algorithms. The modal domain approach to the soundfield/source separation allows us to circumvent several practical limitations of the existing techniques and enhance the performance and robustness of the system. The proposed methods are presented with several practical applications and performance evaluations using simulated and real-life dataset
Deep neural networks for monaural source separation
PhD ThesisIn monaural source separation (MSS) only one recording is available and the
spatial information, generally, cannot be extracted. It is also an undetermined inverse problem. Rcently, the development of the deep neural network
(DNN) provides the framework to address this problem. How to select the
types of neural network models and training targets is the research question.
Moreover, in real room environments, the reverberations from floor, walls,
ceiling and furnitures in a room are challenging, which distort the received
mixture and degrade the separation performance. In many real-world applications, due to the size of hardware, the number of microphones cannot
always be multiple. Hence, deep learning based MSS is the focus of this
thesis.
The first contribution is on improving the separation performance by enhancing the generalization ability of the deep learning-base MSS methods.
According to no free lunch (NFL) theorem, it is impossible to find the neural
network model which can estimate the training target perfectly in all cases.
From the acquired speech mixture, the information of clean speech signal
could be over- or underestimated. Besides, the discriminative criterion objective function can be used to address ambiguous information problem in
the training stage of deep learning. Based on this, the adaptive discriminative criterion is proposed and better separation performance is obtained. In
addition to this, another alternative method is using the sequentially trained
neural network models within different training targets to further estimate
iv
Abstract v
the clean speech signal. By using different training targets, the generalization ability of the neural network models is improved, and thereby better
separation performance.
The second contribution is addressing MSS problem in reverberant room
environments. To achieve this goal, a novel time-frequency (T-F) mask, e.g.
dereverberation mask (DM) is proposed to estimate the relationship between
the reverberant noisy speech mixture and the dereverberated mixture. Then,
a separation mask is exploited to extract the desired clean speech signal from
the noisy speech mixture. The DM can be integrated with ideal ratio mask
(IRM) to generate ideal enhanced mask (IEM) to address both dereverberation and separation problems. Based on the DM and the IEM, a two-stage
approach is proposed with different system structures.
In the final contribution, both phase information of clean speech signal
and long short-term memory (LSTM) recurrent neural network (RNN) are
introduced. A novel complex signal approximation (SA)-based method is
proposed with the complex domain of signals. By utilizing the LSTM RNN
as the neural network model, the temporal information is better used, and
the desired speech signal can be estimated more accurately. Besides, the
phase information of clean speech signal is applied to mitigate the negative
influence from noisy phase information.
The proposed MSS algorithms are evaluated with various challenging
datasets such as the TIMIT, IEEE corpora and NOISEX database. The
algorithms are assessed with state-of-the-art techniques and performance
measures to confirm that the proposed MSS algorithms provide novel solution
Deep Learning for Distant Speech Recognition
Deep learning is an emerging technology that is considered one of the most
promising directions for reaching higher levels of artificial intelligence.
Among the other achievements, building computers that understand speech
represents a crucial leap towards intelligent machines. Despite the great
efforts of the past decades, however, a natural and robust human-machine speech
interaction still appears to be out of reach, especially when users interact
with a distant microphone in noisy and reverberant environments. The latter
disturbances severely hamper the intelligibility of a speech signal, making
Distant Speech Recognition (DSR) one of the major open challenges in the field.
This thesis addresses the latter scenario and proposes some novel techniques,
architectures, and algorithms to improve the robustness of distant-talking
acoustic models. We first elaborate on methodologies for realistic data
contamination, with a particular emphasis on DNN training with simulated data.
We then investigate on approaches for better exploiting speech contexts,
proposing some original methodologies for both feed-forward and recurrent
neural networks. Lastly, inspired by the idea that cooperation across different
DNNs could be the key for counteracting the harmful effects of noise and
reverberation, we propose a novel deep learning paradigm called network of deep
neural networks. The analysis of the original concepts were based on extensive
experimental validations conducted on both real and simulated data, considering
different corpora, microphone configurations, environments, noisy conditions,
and ASR tasks.Comment: PhD Thesis Unitn, 201
Speech processing using digital MEMS microphones
The last few years have seen the start of a unique change in microphones for consumer
devices such as smartphones or tablets. Almost all analogue capacitive microphones
are being replaced by digital silicon microphones or MEMS microphones.
MEMS microphones perform differently to conventional analogue microphones. Their
greatest disadvantage is significantly increased self-noise or decreased SNR, while
their most significant benefits are ease of design and manufacturing and improved sensitivity
matching.
This thesis presents research on speech processing, comparing conventional analogue
microphones with the newly available digital MEMS microphones. Specifically, voice
activity detection, speaker diarisation (who spoke when), speech separation and speech
recognition are looked at in detail.
In order to carry out this research different microphone arrays were built using digital
MEMS microphones and corpora were recorded to test existing algorithms and devise
new ones. Some corpora that were created for the purpose of this research will be
released to the public in 2013.
It was found that the most commonly used VAD algorithm in current state-of-theart
diarisation systems is not the best-performing one, i.e. MLP-based voice activity
detection consistently outperforms the more frequently used GMM-HMM-based VAD
schemes. In addition, an algorithm was derived that can determine the number of active
speakers in a meeting recording given audio data from a microphone array of known
geometry, leading to improved diarisation results.
Finally, speech separation experiments were carried out using different post-filtering
algorithms, matching or exceeding current state-of-the art results.
The performance of the algorithms and methods presented in this thesis was verified
by comparing their output using speech recognition tools and simple MLLR adaptation
and the results are presented as word error rates, an easily comprehensible scale.
To summarise, using speech recognition and speech separation experiments, this thesis
demonstrates that the significantly reduced SNR of the MEMS microphone can be
compensated for with well established adaptation techniques such as MLLR. MEMS
microphones do not affect voice activity detection and speaker diarisation performance
Modelling the nonstationarity of speech in the maximum negentropy beamformer
State-of-the-art automatic speech recognition (ASR) systems can achieve very low word error rates (WERs) of below 5% on data recorded with headsets. However, in many situations such as ASR at meetings or in the car, far field microphones on the table, walls or devices such as laptops are preferable to microphones that have to be worn close to the user\u27s mouths. Unfortunately, the distance between speakers and microphones introduces significant noise and reverberation, and as a consequence the WERs of current ASR systems on this data tend to be unacceptably high (30-50% upwards). The use of a microphone array, i.e. several microphones, can alleviate the problem somewhat by performing spatial filtering: beamforming techniques combine the sensors\u27 output in a way that focuses the processing on a particular direction. Assuming that the signal of interest comes from a different direction than the noise, this can improve the signal quality and reduce the WER by filtering out sounds coming from non-relevant directions. Historically, array processing techniques developed from research on non-speech data, e.g. in the fields of sonar and radar, and as a consequence most techniques were not created to specifically address beamforming in the context of ASR. While this generality can be seen as an advantage in theory, it also means that these methods ignore characteristics which could be used to improve the process in a way that benefits ASR. An example of beamforming adapted to speech processing is the recently proposed maximum negentropy beamformer (MNB), which exploits the statistical characteristics of speech as follows. "Clean" headset speech differs from noisy or reverberant speech in its statistical distribution, which is much less Gaussian in the clean case. Since negentropy is a measure of non-Gaussianity, choosing beamformer weights that maximise the negentropy of the output leads to speech that is closer to clean speech in its distribution, and this in turn has been shown to lead to improved WERs [Kumatani et al., 2009]. In this thesis several refinements of the MNB algorithm are proposed and evaluated. Firstly, a number of modifications to the original MNB configuration are proposed based on theoretical or practical concerns. These changes concern the probability density function (pdf) used to model speech, the estimation of the pdf parameters, and the method of calculating the negentropy. Secondly, a further step is taken to reflect the characteristics of speech by introducing time-varying pdf parameters. The original MNB uses fixed estimates per utterance, which do not account for the nonstationarity of speech. Several time-dependent variance estimates are therefore proposed, beginning with a simple moving average window and including the HMM-MNB, which derives the variance estimate from a set of auxiliary hidden Markov models. All beamformer algorithms presented in this thesis are evaluated through far-field ASR experiments on the Multi-Channel Wall Street Journal Audio-Visual Corpus, a database of utterances captured with real far-field sensors, in a realistic acoustic environment, and spoken by real speakers. While the proposed methods do not lead to an improvement in ASR performance, a more efficient MNB algorithm is developed, and it is shown that comparable results can be achieved with significantly less data than all frames of the utterance, a result which is of particular relevance for real-time implementations.Automatische Spracherkennungssysteme können heutzutage sehr niedrige Wortfehlerraten (WER) unter 5% erreichen, wenn die Sprachdaten mit einem Headset oder anderem Nahbesprechungsmikrofon aufgezeichnet wurden. Allerdings hat das Tragen eines mundnahen Mikrofons in vielen Situationen, wie z.B. der Spracherkennung im Auto oder wĂ€hrend einer Besprechung, praktische Nachteile, und ein auf dem Tisch, an der Wand oder am Laptop befestigtes Mikrofon wĂ€re in dem Fall vorteilhaft. Bei einer gröĂeren Distanz zwischen Mikrofon und Sprecher werden andererseits aber verstĂ€rkt HintergrundgerĂ€usche und Hall aufgenommen, wodurch die Wortfehlerraten hĂ€ufig in einen unakzeptablen Bereich von 30â50% und höher steigen. Ein Mikrofonarray, d.h. eine Gruppe von Mikrofonen, kann hierbei durch rĂ€umliches Filtern in gewissem MaĂe Abhilfe schaffen: sogenannte Beamforming-Methoden können die Daten der einzelnen Sensoren so kombinieren, dass der Fokus auf eine bestimmte Richtung gerichtet wird. Wenn nun ein Zielsignal aus einer anderen Richtung als die StörgerĂ€usche kommt, kann dieser Prozess die SignalqualitĂ€t erhöhen und WER-Werte reduzieren, indem die GerĂ€usche aus den nicht-relevanten Richtungen herausgefiltert werden. Da Beamforming-Techniken sich aus der Forschung an nicht-sprachlichen Daten wie Sonar und Radar entwickelt haben, sind die wenigsten Methoden in diesem Bereich speziell auf das Problem der Spracherkennung ausgerichtet. WĂ€hrend eine AnwendungsunabhĂ€ngigkeit von Vorteil sein kann, bedeutet sie aber auch, dass Eigenschaften der Spracherkennung ignoriert werden, die zur Verbesserung des Ergebnisses genutzt werden könnten. Ein Beispiel fĂŒr einen Beamforming-Algorithmus, der speziell fĂŒr die Verarbeitung von Sprache entwickelt wurde, ist der Maximum Negentropy Beamformer (MNB). Der MNB nutzt die Tatsache, dass "saubere" Sprache, die mit einem Nahbesprechungsmikrofon aufgenommen wurde, eine andere Wahrscheinlichkeitsverteilung aufweist als verrauschte oder verhallte Sprache: Die Verteilung sauberer Sprache unterscheidet sich von der Normalverteilung sehr viel stĂ€rker als die von fern aufgezeichneter Sprache. Der MNB wĂ€hlt Beamforming-Gewichte, die den Negentropy-Wert maximieren, und da Negentropy misst, wie sehr sich eine Verteilung von der Normalverteilung unterscheidet, Ă€hnelt die vom MNB produzierte Sprache statistisch gesehen sauberer Sprache, was zu verbesserten WER-Werten gefĂŒhrt hat [Kumatani et al., 2009]. Das Thema dieser Dissertation ist die Entwicklung und Evaluierung von verschiedenen Modifikationen des MNB. Erstens wird eine Anzahl von praktisch und theoretisch motivierten VerĂ€nderungen vorgeschlagen, die die Form der Wahrscheinlichkeitsverteilung zur Sprachmodellierung, die SchĂ€tzung der Parameter dieser Verteilung und die Berechnung der Negentropy-Werte betreffen. Zweitens wird ein weiterer Schritt zur BerĂŒcksichtigung der Eigenschaften von Sprache unternommen, indem die ZeitabhĂ€ngigkeit der Verteilungsparameter eingefĂŒhrt wird; im ursprĂŒnglichen MNB-Algorithmus sind diese fĂŒr eine ĂuĂerung konstant, was im Gegensatz zur nicht-konstanten Eigenschaft von Sprache steht. Mehrere zeitabhĂ€ngige Varianz-SchĂ€tzungmethoden werden beschrieben und evaluiert, von einem einfachen gleitenden Durchschnittswert bis zum komplexeren HMM-MNB, der die Varianz aus Hidden-Markov-Modellen ableitet. Alle Beamforming-Algorithmen, die in dieser Arbeit vorgestellt werden, werden durch Spracherkennungsexperimente mit dem Multi-Channel Wall Street Journal Audio-Visual Corpus evaluiert. Dieser Korpus wurde nicht durch Simulation erstellt, sondern besteht aus ĂuĂerungen von Personen, die mit echten Sensoren in einer realistischen akustischen Umgebung aufgenommen wurden. Die Ergebnisse zeigen, dass mit den bisher entwickelten Methoden keine Verbesserung der Wortfehlerrate erreicht werden kann. Allerdings wurde ein effizienterer MNB-Algorithmus entwickelt, der vergleichbare Erkennungsraten mit deutlich weniger Sprachdaten erreichen kann, was vor allem fĂŒr eine Echtzeitimplementierung relevant ist