122 research outputs found

    Exploiting correlogram structure for robust speech recognition with multiple speech sources

    Get PDF
    This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as tightly coupled processes. In the first stage sound source separation is performed in the correlogram domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together with the spectral representation, are employed by a `speech fragment decoder' which employs `missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions, which results in significantly better recognition accuracy

    Resynthesis of Acoustic Scenes Combining Sound Source Separation and WaveField Synthesis Techniques

    Full text link
    [ES] La Separacón de Fuentes ha sido un tema de intensa investigación en muchas aplicaciones de tratamiento de señaal, cubriendo desde el procesado de voz al análisis de im'agenes biomédicas. Aplicando estas técnicas a los sistemas de reproducci'on espacial de audio, se puede solucionar una limitaci ón importante en la resíntesis de escenas sonoras 3D: la necesidad de disponer de las se ñales individuales correspondientes a cada fuente. El sistema Wave-field Synthesis (WFS) puede sintetizar un campo acústico mediante arrays de altavoces, posicionando varias fuentes en el espacio. Sin embargo, conseguir las señales de cada fuente de forma independiente es normalmente un problema. En este trabajo se propone la utilización de distintas técnicas de separaci'on de fuentes sonoras para obtener distintas pistas a partir de grabaciones mono o estéreo. Varios métodos de separación han sido implementados y comprobados, siendo uno de ellos desarrollado por el autor. Aunque los algoritmos existentes están lejos de conseguir una alta calidad, se han realizado tests subjetivos que demuestran cómo no es necesario obtener una separación óptima para conseguir resultados aceptables en la reproducción de escenas 3D[EN] Source Separation has been a subject of intense research in many signal processing applications, ranging from speech processing to medical image analysis. Applied to spatial audio systems, it can be used to overcome one fundamental limitation in 3D scene resynthesis: the need of having the independent signals for each source available. Wave-field Synthesis is a spatial sound reproduction system that can synthesize an acoustic field by means of loudspeaker arrays and it is also capable of positioning several sources in space. However, the individual signals corresponding to these sources must be available and this is often a difficult problem. In this work, we propose to use Sound Source Separation techniques in order to obtain different tracks from stereo and mono mixtures. Some separation methods have been implemented and tested, having been one of them developed by the author. Although existing algorithms are far from getting hi-fi quality, subjective tests show how it is not necessary an optimum separation for getting acceptable results in 3D scene reproductionCobos Serrano, M. (2007). Resynthesis of Acoustic Scenes Combining Sound Source Separation and WaveField Synthesis Techniques. http://hdl.handle.net/10251/12515Archivo delegad

    Single-Microphone Speech Separation: The use of Speech Models

    Get PDF

    A computational framework for sound segregation in music signals

    Get PDF
    Tese de doutoramento. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 200

    Binaural scene analysis : localization, detection and recognition of speakers in complex acoustic scenes

    Get PDF
    The human auditory system has the striking ability to robustly localize and recognize a specific target source in complex acoustic environments while ignoring interfering sources. Surprisingly, this remarkable capability, which is referred to as auditory scene analysis, is achieved by only analyzing the waveforms reaching the two ears. Computers, however, are presently not able to compete with the performance achieved by the human auditory system, even in the restricted paradigm of confronting a computer algorithm based on binaural signals with a highly constrained version of auditory scene analysis, such as localizing a sound source in a reverberant environment or recognizing a speaker in the presence of interfering noise. In particular, the problem of focusing on an individual speech source in the presence of competing speakers, termed the cocktail party problem, has been proven to be extremely challenging for computer algorithms. The primary objective of this thesis is the development of a binaural scene analyzer that is able to jointly localize, detect and recognize multiple speech sources in the presence of reverberation and interfering noise. The processing of the proposed system is divided into three main stages: localization stage, detection of speech sources, and recognition of speaker identities. The only information that is assumed to be known a priori is the number of target speech sources that are present in the acoustic mixture. Furthermore, the aim of this work is to reduce the performance gap between humans and machines by improving the performance of the individual building blocks of the binaural scene analyzer. First, a binaural front-end inspired by auditory processing is designed to robustly determine the azimuth of multiple, simultaneously active sound sources in the presence of reverberation. The localization model builds on the supervised learning of azimuthdependent binaural cues, namely interaural time and level differences. Multi-conditional training is performed to incorporate the uncertainty of these binaural cues resulting from reverberation and the presence of competing sound sources. Second, a speech detection module that exploits the distinct spectral characteristics of speech and noise signals is developed to automatically select azimuthal positions that are likely to correspond to speech sources. Due to the established link between the localization stage and the recognition stage, which is realized by the speech detection module, the proposed binaural scene analyzer is able to selectively focus on a predefined number of speech sources that are positioned at unknown spatial locations, while ignoring interfering noise sources emerging from other spatial directions. Third, the speaker identities of all detected speech sources are recognized in the final stage of the model. To reduce the impact of environmental noise on the speaker recognition performance, a missing data classifier is combined with the adaptation of speaker models using a universal background model. This combination is particularly beneficial in nonstationary background noise

    Evaluation of the Importance of Time-Frequency Contributions to Speech Intelligibility in Noise

    Get PDF
    Recent studies on binary masking techniques make the assumption that each time-frequency (T-F) unit contributes an equal amount to the overall intelligibility of speech. The present study demonstrated that the importance of each T-F unit to speech intelligibility varies in accordance with speech content. Specifically, T-F units are categorized into two classes, speech-present T-F units and speech-absent T-F units. Results indicate that the importance of each speech-present T-F unit to speech intelligibility is highly related to the loudness of its target component, while the importance of each speech-absent T-F unit varies according to the loudness of its masker component. Two types of mask errors are also considered, which include miss and false alarm errors. Consistent with previous work, false alarm errors are shown to be more harmful to speech intelligibility than miss errors when the mixture signal-to-noise ratio (SNR) is below 0 dB. However, the relative importance between the two types of error is conditioned on the SNR level of the input speech signal. Based on these observations, a mask-based objective measure, the loudness weighted hit-false, is proposed for predicting speech intelligibility. The proposed objective measure shows significantly higher correlation with intelligibility compared to two existing mask-based objective measures

    Binaural Source Separation with Convolutional Neural Networks

    Get PDF
    This work is a study on source separation techniques for binaural music mixtures. The chosen framework uses a Convolutional Neural Network (CNN) to estimate time-frequency soft masks. This masks are used to extract the different sources from the original two-channel mixture signal. Its baseline single-channel architecture performed state-of-the-art results on monaural music mixtures under low-latency conditions. It has been extended to perform separation in two-channel signals, being the first two-channel CNN joint estimation architecture. This means that filters are learned for each source by taking in account both channels information. Furthermore, a specific binaural condition is included during training stage. It uses Interaural Level Difference (ILD) information to improve spatial images of extracted sources. Concurrently, we present a novel tool to create binaural scenes for testing purposes. Multiple binaural scenes are rendered from a music dataset of four instruments (voice, drums, bass and others). The CNN framework have been tested for these binaural scenes and compared with monaural and stereo results. The system showed a great amount of adaptability and good separation results in all the scenarios. These results are used to evaluate spatial information impact on separation performance
    corecore