20 research outputs found
Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification
International audienceIn this paper, we study the usefulness of various matrix factorization methods for learning features to be used for the specific Acoustic Scene Classification problem. A common way of addressing ASC has been to engineer features capable of capturing the specificities of acoustic environments. Instead, we show that better representations of the scenes can be automatically learned from time-frequency representations using matrix factorization techniques. We mainly focus on extensions including sparse, kernel-based, convolutive and a novel supervised dictionary learning variant of Principal Component Analysis and Nonnegative Matrix Factorization. An experimental evaluation is performed on two of the largest ASC datasets available in order to compare and discuss the usefulness of these methods for the task. We show that the unsupervised learning methods provide better representations of acoustic scenes than the best conventional hand-crafted features on both datasets. Furthermore, the introduction of a novel nonnegative supervised matrix factorization model and Deep Neural networks trained on spectrograms, allow us to reach further improvements
Binaural Source Separation with Convolutional Neural Networks
This work is a study on source separation techniques for binaural music mixtures. The chosen framework uses a Convolutional Neural Network (CNN) to estimate time-frequency soft masks. This masks are used to extract the different sources from the original two-channel mixture signal. Its baseline single-channel architecture performed state-of-the-art results on monaural music mixtures under low-latency conditions. It has been extended to perform separation in two-channel signals, being the first two-channel CNN joint estimation architecture. This means that filters are learned for each source by taking in account both channels information. Furthermore, a specific binaural condition is included during training stage. It uses Interaural Level Difference (ILD) information to improve spatial images of extracted sources. Concurrently, we present a novel tool to create binaural scenes for testing purposes. Multiple binaural scenes are rendered from a music dataset of four instruments (voice, drums, bass and others). The CNN framework have been tested for these binaural scenes and compared with monaural and stereo results. The system showed a great amount of adaptability and good separation results in all the scenarios. These results are used to evaluate spatial information impact on separation performance
Single channel audio separation using deep neural networks and matrix factorizations
PhD ThesisSource Separation has become a significant research topic in the signal processing community and the machine learning area. Due to numerous applications, such as automatic speech recognition and speech communication, separation of target speech from the mixed signal is of great importance. In many practical applications, speech separation from a single recorder is most desirable from an application standpoint. In this thesis, two novel approaches have been proposed to address this single channel audio separation problem. This thesis first reviews traditional approaches for single channel source separation, and later elicits a generic approach, which is more capable of feature learning, i.e. deep graphical models.
In the first part of this thesis, a novel approach based on matrix factorization and hierarchical model has been proposed. In this work, an artificial stereo mixture is formulated to provide extra information. In addition, a hybrid framework that combines the generalized Expectation-Maximization algorithm with a multiplicative update rule is proposed to optimize the parameters of a matrix factorization based approach to approximatively separate the mixture. Furthermore, a hierarchical model based on an extreme learning machine is developed to check the validity of the approximately separated sources followed by an energy minimization method to further improve the quality of the separated sources by generating a time-frequency mask. Various experiments have been conducted and the obtained results have shown that the proposed approach outperforms conventional approaches not only in reduction of computational complexity, but also the separation performance.
In the second part, a deep neural network based ensemble system is proposed. In this work, the complementary property of different features are fully explored by âwideâ and âforwardâ ensemble system. In addition, instead of using the features learned from the output layer, the features learned from the penultimate layer are investigated. The final embedded features are classified with an extreme learning machine to generate a binary mask to separate a mixed signal. The experiment focuses on speech in the presence of music and the obtained results demonstrated that the proposed ensemble system has the ability to explore the complementary property of various features thoroughly under various conditions with promising separation performance
Single-Microphone Speech Enhancement and Separation Using Deep Learning
The cocktail party problem comprises the challenging task of understanding a
speech signal in a complex acoustic environment, where multiple speakers and
background noise signals simultaneously interfere with the speech signal of
interest. A signal processing algorithm that can effectively increase the
speech intelligibility and quality of speech signals in such complicated
acoustic situations is highly desirable. Especially for applications involving
mobile communication devices and hearing assistive devices. Due to the
re-emergence of machine learning techniques, today, known as deep learning, the
challenges involved with such algorithms might be overcome. In this PhD thesis,
we study and develop deep learning-based techniques for two sub-disciplines of
the cocktail party problem: single-microphone speech enhancement and
single-microphone multi-talker speech separation. Specifically, we conduct
in-depth empirical analysis of the generalizability capability of modern deep
learning-based single-microphone speech enhancement algorithms. We show that
performance of such algorithms is closely linked to the training data, and good
generalizability can be achieved with carefully designed training data.
Furthermore, we propose uPIT, a deep learning-based algorithm for
single-microphone speech separation and we report state-of-the-art results on a
speaker-independent multi-talker speech separation task. Additionally, we show
that uPIT works well for joint speech separation and enhancement without
explicit prior knowledge about the noise type or number of speakers. Finally,
we show that deep learning-based speech enhancement algorithms designed to
minimize the classical short-time spectral amplitude mean squared error leads
to enhanced speech signals which are essentially optimal in terms of STOI, a
state-of-the-art speech intelligibility estimator.Comment: PhD Thesis. 233 page
Deep neural networks for monaural source separation
PhD ThesisIn monaural source separation (MSS) only one recording is available and the
spatial information, generally, cannot be extracted. It is also an undetermined inverse problem. Rcently, the development of the deep neural network
(DNN) provides the framework to address this problem. How to select the
types of neural network models and training targets is the research question.
Moreover, in real room environments, the reverberations from floor, walls,
ceiling and furnitures in a room are challenging, which distort the received
mixture and degrade the separation performance. In many real-world applications, due to the size of hardware, the number of microphones cannot
always be multiple. Hence, deep learning based MSS is the focus of this
thesis.
The first contribution is on improving the separation performance by enhancing the generalization ability of the deep learning-base MSS methods.
According to no free lunch (NFL) theorem, it is impossible to find the neural
network model which can estimate the training target perfectly in all cases.
From the acquired speech mixture, the information of clean speech signal
could be over- or underestimated. Besides, the discriminative criterion objective function can be used to address ambiguous information problem in
the training stage of deep learning. Based on this, the adaptive discriminative criterion is proposed and better separation performance is obtained. In
addition to this, another alternative method is using the sequentially trained
neural network models within different training targets to further estimate
iv
Abstract v
the clean speech signal. By using different training targets, the generalization ability of the neural network models is improved, and thereby better
separation performance.
The second contribution is addressing MSS problem in reverberant room
environments. To achieve this goal, a novel time-frequency (T-F) mask, e.g.
dereverberation mask (DM) is proposed to estimate the relationship between
the reverberant noisy speech mixture and the dereverberated mixture. Then,
a separation mask is exploited to extract the desired clean speech signal from
the noisy speech mixture. The DM can be integrated with ideal ratio mask
(IRM) to generate ideal enhanced mask (IEM) to address both dereverberation and separation problems. Based on the DM and the IEM, a two-stage
approach is proposed with different system structures.
In the final contribution, both phase information of clean speech signal
and long short-term memory (LSTM) recurrent neural network (RNN) are
introduced. A novel complex signal approximation (SA)-based method is
proposed with the complex domain of signals. By utilizing the LSTM RNN
as the neural network model, the temporal information is better used, and
the desired speech signal can be estimated more accurately. Besides, the
phase information of clean speech signal is applied to mitigate the negative
influence from noisy phase information.
The proposed MSS algorithms are evaluated with various challenging
datasets such as the TIMIT, IEEE corpora and NOISEX database. The
algorithms are assessed with state-of-the-art techniques and performance
measures to confirm that the proposed MSS algorithms provide novel solution
Single channel overlapped-speech detection and separation of spontaneous conversations
PhD ThesisIn the thesis, spontaneous conversation containing both speech mixture and speech dialogue is considered. The speech mixture refers to speakers speaking simultaneously (i.e. the overlapped-speech). The speech dialogue refers to only one speaker is actively speaking and the other is silent. That Input conversation is firstly processed by the overlapped-speech detection. Two output signals are then segregated into dialogue and mixture formats. The dialogue is processed by speaker diarization. Its outputs are the individual speech of each speaker. The mixture is processed by speech separation. Its outputs are independent separated speech signals of the speaker. When the separation input contains only the mixture, blind speech separation approach is used. When the separation is assisted by the outputs of the speaker diarization, it is informed speech separation. The research presents novel: overlapped-speech detection algorithm, and two speech separation algorithms.
The proposed overlapped-speech detection is an algorithm to estimate the switching instants of the input. Optimization loop is adapted to adopt the best capsulated audio features and to avoid the worst. The optimization depends on principles of the pattern recognition, and k-means clustering. For of 300 simulated conversations, averages of: False-Alarm Error is 1.9%, Missed-Speech Error is 0.4%, and Overlap-Speaker Error is 1%. Approximately, these errors equal the errors of best recent reliable speaker diarization corpuses.
The proposed blind speech separation algorithm consists of four sequential techniques: filter-bank analysis, Non-negative Matrix Factorization (NMF), speaker clustering and filter-bank synthesis. Instead of the required speaker segmentation, effective standard framing is contributed. Average obtained objective tests (SAR, SDR and SIR) of 51 simulated conversations are: 5.06dB, 4.87dB and 12.47dB respectively.
For the proposed informed speech separation algorithm, outputs of the speaker diarization are a generated-database. The database associated the speech separation by creating virtual targeted-speech and mixture. The contributed virtual signals are trained to facilitate the separation by homogenising them with the NMF-matrix elements of the real mixture. Contributed masking optimized the resulting speech. Average obtained SAR, SDR and SIR of 341 simulated conversations are 9.55dB, 1.12dB, and 2.97dB respectively.
Per the objective tests of the two speech separation algorithms, they are in the mid-range of the well-known NMF-based audio and speech separation methods