33 research outputs found

    Listening and grouping: an online autoregressive approach for monaural speech separation

    Get PDF
    This paper proposes an autoregressive approach to harness the power of deep learning for multi-speaker monaural speech separation. It exploits a causal temporal context in both mixture and past estimated separated signals and performs online separation that is compatible with real-time applications. The approach adopts a learned listening and grouping architecture motivated by computational auditory scene analysis, with a grouping stage that effectively addresses the label permutation problem at both frame and segment levels. Experimental results on the benchmark WSJ0-2mix dataset show that the new approach can outperform the majority of state-of-the-art methods in both closed-set and open-set conditions in terms of signal-to-distortion ratio (SDR) improvement and perceptual evaluation of speech quality (PESQ), even approaches that exploit whole-utterance statistics for separation, with relatively fewer model parameters

    Single-Microphone Speech Enhancement and Separation Using Deep Learning

    Get PDF
    The cocktail party problem comprises the challenging task of understanding a speech signal in a complex acoustic environment, where multiple speakers and background noise signals simultaneously interfere with the speech signal of interest. A signal processing algorithm that can effectively increase the speech intelligibility and quality of speech signals in such complicated acoustic situations is highly desirable. Especially for applications involving mobile communication devices and hearing assistive devices. Due to the re-emergence of machine learning techniques, today, known as deep learning, the challenges involved with such algorithms might be overcome. In this PhD thesis, we study and develop deep learning-based techniques for two sub-disciplines of the cocktail party problem: single-microphone speech enhancement and single-microphone multi-talker speech separation. Specifically, we conduct in-depth empirical analysis of the generalizability capability of modern deep learning-based single-microphone speech enhancement algorithms. We show that performance of such algorithms is closely linked to the training data, and good generalizability can be achieved with carefully designed training data. Furthermore, we propose uPIT, a deep learning-based algorithm for single-microphone speech separation and we report state-of-the-art results on a speaker-independent multi-talker speech separation task. Additionally, we show that uPIT works well for joint speech separation and enhancement without explicit prior knowledge about the noise type or number of speakers. Finally, we show that deep learning-based speech enhancement algorithms designed to minimize the classical short-time spectral amplitude mean squared error leads to enhanced speech signals which are essentially optimal in terms of STOI, a state-of-the-art speech intelligibility estimator.Comment: PhD Thesis. 233 page

    Single-Microphone Speech Enhancement and Separation Using Deep Learning

    Get PDF

    Proceedings of the Second International Mobile Satellite Conference (IMSC 1990)

    Get PDF
    Presented here are the proceedings of the Second International Mobile Satellite Conference (IMSC), held June 17-20, 1990 in Ottawa, Canada. Topics covered include future mobile satellite communications concepts, aeronautical applications, modulation and coding, propagation and experimental systems, mobile terminal equipment, network architecture and control, regulatory and policy considerations, vehicle antennas, and speech compression

    Speech dereverberation and speaker separation using microphone arrays in realistic environments

    Get PDF
    This thesis concentrates on comparing novel and existing dereverberation and speaker separation techniques using multiple corpora, including a new corpus collected using a microphone array. Many corpora currently used for these techniques are recorded using head-mounted microphones in anechoic chambers. This novel corpus contains recordings with noise and reverberation made in office and workshop environments. Novel algorithms present a different way of approximating the reverberation, producing results that are competitive with existing algorithms. Dereverberation is evaluated using seven correlation-based algorithms and applied to two different corpora. Three of these are novel algorithms (Hs NTF, Cauchy WPE and Cauchy MIMO WPE). Both non-learning and learning algorithms are tested, with the learning algorithms performing better. For single and multi-channel speaker separation, unsupervised non-negative matrix factorization (NMF) algorithms are compared using three cost functions combined with sparsity, convolution and direction of arrival. The results show that the choice of cost function is important for improving the separation result. Furthermore, six different supervised deep learning algorithms are applied to single channel speaker separation. Historic information improves the result. When comparing NMF to deep learning, NMF is able to converge faster to a solution and provides a better result for the corpora used in this thesis

    Detection and handling of overlapping speech for speaker diarization

    Get PDF
    For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings, compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also due to the presence of overlapping speech. Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually lead to corrupt single-speaker models and thus to a worse segmentation. This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component analysis, linear discriminant analysis, or by a multi-layer perceptron. In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then, a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features. The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the detection of overlapping speech, especially on data originating from a single recording site. In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of the diarization algorithm. During the system development it was discovered that it is favorable to do an independent optimization of overlap exclusion and labeling with respect to the overlap detection system. We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments with NIST RT data show DER improvement on the RT ¿09 meeting recordings as well. The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between various settings of the overlap detection operation point. However, a high performance variability across different recordings is also typical of the baseline diarization system, without any overlap handling
    corecore