6 research outputs found

    Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information

    Full text link

    Acoustic Beamforming for Speaker Diarization of Meetings

    Full text link

    Articulatory features for conversational speech recognition

    Get PDF

    Robust speaker diarization for meetings

    Get PDF
    Aquesta tesi doctoral mostra la recerca feta en l'脿rea de la diaritzaci贸 de locutor per a sales de reunions. En la present s'estudien els algorismes i la implementaci贸 d'un sistema en diferit de segmentaci贸 i aglomerat de locutor per a grabacions de reunions a on normalment es t茅 acc茅s a m茅s d'un micr貌fon per al processat. El bloc m茅s important de recerca s'ha fet durant una estada al International Computer Science Institute (ICSI, Berkeley, Caligornia) per un per铆ode de dos anys.La diaritzaci贸 de locutor s'ha estudiat for莽a per al domini de grabacions de r脿dio i televisi贸. La majoria dels sistemes proposats utilitzen algun tipus d'aglomerat jer脿rquic de les dades en grups ac煤stics a on de bon principi no se sap el n煤mero de locutors 貌ptim ni tampoc la seva identitat. Un m猫tode molt comunment utilitzat s'anomena "bottom-up clustering" (aglomerat de baix-a-dalt), amb el qual inicialment es defineixen molts grups ac煤stics de dades que es van ajuntant de manera iterativa fins a obtenir el nombre 貌ptim de grups tot i acomplint un criteri de parada. Tots aquests sistemes es basen en l'an脿lisi d'un canal d'entrada individual, el qual no permet la seva aplicaci贸 directa per a reunions. A m茅s a m茅s, molts d'aquests algorisms necessiten entrenar models o afinar els parameters del sistema usant dades externes, el qual dificulta l'aplicabilitat d'aquests sistemes per a dades diferents de les usades per a l'adaptaci贸.La implementaci贸 proposada en aquesta tesi es dirigeix a solventar els problemes mencionats anteriorment. Aquesta pren com a punt de partida el sistema existent al ICSI de diaritzaci贸 de locutor basat en l'aglomerat de "baix-a-dalt". Primer es processen els canals de grabaci贸 disponibles per a obtindre un sol canal d'audio de qualitat major, a m茅s d铆nformaci贸 sobre la posici贸 dels locutors existents. Aleshores s'implementa un sistema de detecci贸 de veu/silenci que no requereix de cap entrenament previ, i processa els segments de veu resultant amb una versi贸 millorada del sistema mono-canal de diaritzaci贸 de locutor. Aquest sistema ha estat modificat per a l'煤s de l'informaci贸 de posici贸 dels locutors (quan es tingui) i s'han adaptat i creat nous algorismes per a que el sistema obtingui tanta informaci贸 com sigui possible directament del senyal acustic, fent-lo menys depenent de les dades de desenvolupament. El sistema resultant 茅s flexible i es pot usar en qualsevol tipus de sala de reunions pel que fa al nombre de micr貌fons o la seva posici贸. El sistema, a m茅s, no requereix en absolute dades d麓entrenament, sent m茅s senzill adaptar-lo a diferents tipus de dades o dominis d'aplicaci贸. Finalment, fa un pas endavant en l'煤s de parametres que siguin mes robusts als canvis en les dades ac煤stiques. Dos versions del sistema es van presentar amb resultats excel.lents a les evaluacions de RT05s i RT06s del NIST en transcripci贸 rica per a reunions, a on aquests es van avaluar amb dades de dos subdominis diferents (conferencies i reunions). A m茅s a m茅s, es fan experiments utilitzant totes les dades disponibles de les evaluacions RT per a demostrar la viabilitat dels algorisms proposats en aquesta tasca.This thesis shows research performed into the topic of speaker diarization for meeting rooms. It looks into the algorithms and the implementation of an offline speaker segmentation and clustering system for a meeting recording where usually more than one microphone is available. The main research and system implementation has been done while visiting the International Computes Science Institute (ICSI, Berkeley, California) for a period of two years. Speaker diarization is a well studied topic on the domain of broadcast news recordings. Most of the proposed systems involve some sort of hierarchical clustering of the data into clusters, where the optimum number of speakers of their identities are unknown a priory. A very commonly used method is called bottom-up clustering, where multiple initial clusters are iteratively merged until the optimum number of clusters is reached, according to some stopping criterion. Such systems are based on a single channel input, not allowing a direct application for the meetings domain. Although some efforts have been done to adapt such systems to multichannel data, at the start of this thesis no effective implementation had been proposed. Furthermore, many of these speaker diarization algorithms involve some sort of models training or parameter tuning using external data, which impedes its usability with data different from what they have been adapted to.The implementation proposed in this thesis works towards solving the aforementioned problems. Taking the existing hierarchical bottom-up mono-channel speaker diarization system from ICSI, it first uses a flexible acoustic beamforming to extract speaker location information and obtain a single enhanced signal from all available microphones. It then implements a train-free speech/non-speech detection on such signal and processes the resulting speech segments with an improved version of the mono-channel speaker diarization system. Such system has been modified to use speaker location information (then available) and several algorithms have been adapted or created new to adapt the system behavior to each particular recording by obtaining information directly from the acoustics, making it less dependent on the development data.The resulting system is flexible to any meetings room layout regarding the number of microphones and their placement. It is train-free making it easy to adapt to different sorts of data and domains of application. Finally, it takes a step forward into the use of parameters that are more robust to changes in the acoustic data. Two versions of the system were submitted with excellent results in RT05s and RT06s NIST Rich Transcription evaluations for meetings, where data from two different subdomains (lectures and conferences) was evaluated. Also, experiments using the RT datasets from all meetings evaluations were used to test the different proposed algorithms proving their suitability to the task.Postprint (published version

    Design transaction monitoring: understanding design reviews for extended knowledge capture

    Get PDF
    EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    From switchboard to meetings: Development of the 2004 icsi-sri-uw meeting recognition system

    No full text
    We describe the ICSI-SRI-UW team鈥檚 entry in the Spring 2004 NIST Meeting Recognition Evaluation. The system was derived from SRI鈥檚 5xRT Conversational Telephone Speech (CTS) recognizer by adapting CTS acoustic and language models to the Meeting domain, adding noise reduction and delay-sum array processing for far-field recognition, and postprocessing for cross-talk suppression. A modified MAP adaptation procedure was developed to make best use of discriminatively trained (MMIE) prior models. These meeting-specific changes yielded an overall 9 % and 22 % relative improvement as compared to the original CTS system, and 16 % and 29 % relative improvement as compared to our 2002 Meeting Evaluation system, for the individual-headset and multiple-distant microphones conditions, respectively. 1
    corecore