32 research outputs found

    Robust speaker diarization for meetings

    Get PDF
    Aquesta tesi doctoral mostra la recerca feta en l'脿rea de la diaritzaci贸 de locutor per a sales de reunions. En la present s'estudien els algorismes i la implementaci贸 d'un sistema en diferit de segmentaci贸 i aglomerat de locutor per a grabacions de reunions a on normalment es t茅 acc茅s a m茅s d'un micr貌fon per al processat. El bloc m茅s important de recerca s'ha fet durant una estada al International Computer Science Institute (ICSI, Berkeley, Caligornia) per un per铆ode de dos anys.La diaritzaci贸 de locutor s'ha estudiat for莽a per al domini de grabacions de r脿dio i televisi贸. La majoria dels sistemes proposats utilitzen algun tipus d'aglomerat jer脿rquic de les dades en grups ac煤stics a on de bon principi no se sap el n煤mero de locutors 貌ptim ni tampoc la seva identitat. Un m猫tode molt comunment utilitzat s'anomena "bottom-up clustering" (aglomerat de baix-a-dalt), amb el qual inicialment es defineixen molts grups ac煤stics de dades que es van ajuntant de manera iterativa fins a obtenir el nombre 貌ptim de grups tot i acomplint un criteri de parada. Tots aquests sistemes es basen en l'an脿lisi d'un canal d'entrada individual, el qual no permet la seva aplicaci贸 directa per a reunions. A m茅s a m茅s, molts d'aquests algorisms necessiten entrenar models o afinar els parameters del sistema usant dades externes, el qual dificulta l'aplicabilitat d'aquests sistemes per a dades diferents de les usades per a l'adaptaci贸.La implementaci贸 proposada en aquesta tesi es dirigeix a solventar els problemes mencionats anteriorment. Aquesta pren com a punt de partida el sistema existent al ICSI de diaritzaci贸 de locutor basat en l'aglomerat de "baix-a-dalt". Primer es processen els canals de grabaci贸 disponibles per a obtindre un sol canal d'audio de qualitat major, a m茅s d铆nformaci贸 sobre la posici贸 dels locutors existents. Aleshores s'implementa un sistema de detecci贸 de veu/silenci que no requereix de cap entrenament previ, i processa els segments de veu resultant amb una versi贸 millorada del sistema mono-canal de diaritzaci贸 de locutor. Aquest sistema ha estat modificat per a l'煤s de l'informaci贸 de posici贸 dels locutors (quan es tingui) i s'han adaptat i creat nous algorismes per a que el sistema obtingui tanta informaci贸 com sigui possible directament del senyal acustic, fent-lo menys depenent de les dades de desenvolupament. El sistema resultant 茅s flexible i es pot usar en qualsevol tipus de sala de reunions pel que fa al nombre de micr貌fons o la seva posici贸. El sistema, a m茅s, no requereix en absolute dades d麓entrenament, sent m茅s senzill adaptar-lo a diferents tipus de dades o dominis d'aplicaci贸. Finalment, fa un pas endavant en l'煤s de parametres que siguin mes robusts als canvis en les dades ac煤stiques. Dos versions del sistema es van presentar amb resultats excel.lents a les evaluacions de RT05s i RT06s del NIST en transcripci贸 rica per a reunions, a on aquests es van avaluar amb dades de dos subdominis diferents (conferencies i reunions). A m茅s a m茅s, es fan experiments utilitzant totes les dades disponibles de les evaluacions RT per a demostrar la viabilitat dels algorisms proposats en aquesta tasca.This thesis shows research performed into the topic of speaker diarization for meeting rooms. It looks into the algorithms and the implementation of an offline speaker segmentation and clustering system for a meeting recording where usually more than one microphone is available. The main research and system implementation has been done while visiting the International Computes Science Institute (ICSI, Berkeley, California) for a period of two years. Speaker diarization is a well studied topic on the domain of broadcast news recordings. Most of the proposed systems involve some sort of hierarchical clustering of the data into clusters, where the optimum number of speakers of their identities are unknown a priory. A very commonly used method is called bottom-up clustering, where multiple initial clusters are iteratively merged until the optimum number of clusters is reached, according to some stopping criterion. Such systems are based on a single channel input, not allowing a direct application for the meetings domain. Although some efforts have been done to adapt such systems to multichannel data, at the start of this thesis no effective implementation had been proposed. Furthermore, many of these speaker diarization algorithms involve some sort of models training or parameter tuning using external data, which impedes its usability with data different from what they have been adapted to.The implementation proposed in this thesis works towards solving the aforementioned problems. Taking the existing hierarchical bottom-up mono-channel speaker diarization system from ICSI, it first uses a flexible acoustic beamforming to extract speaker location information and obtain a single enhanced signal from all available microphones. It then implements a train-free speech/non-speech detection on such signal and processes the resulting speech segments with an improved version of the mono-channel speaker diarization system. Such system has been modified to use speaker location information (then available) and several algorithms have been adapted or created new to adapt the system behavior to each particular recording by obtaining information directly from the acoustics, making it less dependent on the development data.The resulting system is flexible to any meetings room layout regarding the number of microphones and their placement. It is train-free making it easy to adapt to different sorts of data and domains of application. Finally, it takes a step forward into the use of parameters that are more robust to changes in the acoustic data. Two versions of the system were submitted with excellent results in RT05s and RT06s NIST Rich Transcription evaluations for meetings, where data from two different subdomains (lectures and conferences) was evaluated. Also, experiments using the RT datasets from all meetings evaluations were used to test the different proposed algorithms proving their suitability to the task.Postprint (published version

    Non-Intrusive Speech Intelligibility Prediction

    Get PDF

    Speech Intelligibility Prediction for Hearing Aid Systems

    Get PDF

    Comodulation masking release in bit-rate reduction systems

    Get PDF

    Single- and multi-microphone speech dereverberation using spectral enhancement

    Get PDF
    In speech communication systems, such as voice-controlled systems, hands-free mobile telephones, and hearing aids, the received microphone signals are degraded by room reverberation, background noise, and other interferences. This signal degradation may lead to total unintelligibility of the speech and decreases the performance of automatic speech recognition systems. In the context of this work reverberation is the process of multi-path propagation of an acoustic sound from its source to one or more microphones. The received microphone signal generally consists of a direct sound, reflections that arrive shortly after the direct sound (commonly called early reverberation), and reflections that arrive after the early reverberation (commonly called late reverberation). Reverberant speech can be described as sounding distant with noticeable echo and colouration. These detrimental perceptual effects are primarily caused by late reverberation, and generally increase with increasing distance between the source and microphone. Conversely, early reverberations tend to improve the intelligibility of speech. In combination with the direct sound it is sometimes referred to as the early speech component. Reduction of the detrimental effects of reflections is evidently of considerable practical importance, and is the focus of this dissertation. More specifically the dissertation deals with dereverberation techniques, i.e., signal processing techniques to reduce the detrimental effects of reflections. In the dissertation, novel single- and multimicrophone speech dereverberation algorithms are developed that aim at the suppression of late reverberation, i.e., at estimation of the early speech component. This is done via so-called spectral enhancement techniques that require a specific measure of the late reverberant signal. This measure, called spectral variance, can be estimated directly from the received (possibly noisy) reverberant signal(s) using a statistical reverberation model and a limited amount of a priori knowledge about the acoustic channel(s) between the source and the microphone(s). In our work an existing single-channel statistical reverberation model serves as a starting point. The model is characterized by one parameter that depends on the acoustic characteristics of the environment. We show that the spectral variance estimator that is based on this model, can only be used when the source-microphone distance is larger than the so-called critical distance. This is, crudely speaking, the distance where the direct sound power is equal to the total reflective power. A generalization of the statistical reverberation model in which the direct sound is incorporated is developed. This model requires one additional parameter that is related to the ratio between the direct sound energy and the sound energy of all reflections. The generalized model is used to derive a novel spectral variance estimator. When the novel estimator is used for dereverberation rather than the existing estimator, and the source-microphone distance is smaller than the critical distance, the dereverberation performance is significantly increased. Single-microphone systems only exploit the temporal and spectral diversity of the received signal. Reverberation, of course, also induces spatial diversity. To additionally exploit this diversity, multiple microphones must be used, and their outputs must be combined by a suitable spatial processor such as the so-called delay and sum beamformer. It is not a priori evident whether spectral enhancement is best done before or after the spatial processor. For this reason we investigate both possibilities, as well as a merge of the spatial processor and the spectral enhancement technique. An advantage of the latter option is that the spectral variance estimator can be further improved. Our experiments show that the use of multiple microphones affords a significant improvement of the perceptual speech quality. The applicability of the theory developed in this dissertation is demonstrated using a hands-free communication system. Since hands-free systems are often used in a noisy and reverberant environment, the received microphone signal does not only contain the desired signal but also interferences such as room reverberation that is caused by the desired source, background noise, and a far-end echo signal that results from a sound that is produced by the loudspeaker. Usually an acoustic echo canceller is used to cancel the far-end echo. Additionally a post-processor is used to suppress background noise and residual echo, i.e., echo which could not be cancelled by the echo canceller. In this work a novel structure and post-processor for an acoustic echo canceller are developed. The post-processor suppresses late reverberation caused by the desired source, residual echo, and background noise. The late reverberation and late residual echo are estimated using the generalized statistical reverberation model. Experimental results convincingly demonstrate the benefits of the proposed system for suppressing late reverberation, residual echo and background noise. The proposed structure and post-processor have a low computational complexity, a highly modular structure, can be seamlessly integrated into existing hands-free communication systems, and affords a significant increase of the listening comfort and speech intelligibility
    corecore