7 research outputs found
Speech Enhancement for Automatic Analysis of Child-Centered Audio Recordings
Analysis of child-centred daylong naturalist audio recordings has become a de-facto research protocol in the scientific study of child language development. The researchers are increasingly using these recordings to understand linguistic environment a child encounters in her routine interactions with the world. These audio recordings are captured by a microphone that a child wears throughout a day. The audio recordings, being naturalistic, contain a lot of unwanted sounds from everyday life which degrades the performance of speech analysis tasks. The purpose of this thesis is to investigate the utility of speech enhancement (SE) algorithms in the automatic analysis of such recordings. To this effect, several classical signal processing and modern machine learning-based SE methods were employed 1) as a denoiser for speech corrupted with additive noise sampled from real-life child-centred daylong recordings and 2) as front-end for downstream speech processing tasks of addressee classification (infant vs. adult-directed speech) and automatic syllable count estimation from the speech. The downstream tasks were conducted on data derived from a set of geographically, culturally, and linguistically diverse child-centred daylong audio recordings. The performance of denoising was evaluated through objective quality metrics (spectral distortion and instrumental intelligibility) and through the downstream task performance. Finally, the objective evaluation results were compared with downstream task performance results to find whether objective metrics can be used as a reasonable proxy to select SE front-end for a downstream task. The results obtained show that a recently proposed Long Short-Term Memory (LSTM)-based progressive learning architecture provides maximum performance gains in the downstream tasks in comparison with the other SE methods and baseline results. Classical signal processing-based SE methods also lead to competitive performance. From the comparison of objective assessment and downstream task performance results, no predictive relationship between task-independent objective metrics and performance of downstream tasks was found
Speaker Diarization
DisertaÄŤnĂ práce se zaměřuje na tĂ©ma diarizace Ĺ™eÄŤnĂkĹŻ, coĹľ je Ăşloha zpracovánĂ Ĺ™eÄŤi typicky charakterizovaná otázkou "Kdo kdy mluvĂ?". Práce se takĂ© zabĂ˝vá souvisejĂcĂ Ăşlohou detekce pĹ™ekrĂ˝vajĂcĂ se Ĺ™eÄŤi, která je velmi relevantnĂ pro diarizaci.
Teoretická část práce poskytuje pĹ™ehled existujĂcĂch metod diarizace Ĺ™eÄŤnĂkĹŻ, a to jak tÄ›ch offline, tak online, a pĹ™ibliĹľuje nÄ›kolik problematickĂ˝ch oblastĂ, kterĂ© byly identifikovány v ranĂ© fázi autorÄŤina vĂ˝zkumu. V práci je takĂ© pĹ™edloĹľeno rozsáhlĂ© srovnánĂ existujĂcĂch systĂ©mĹŻ se zaměřenĂm na jejich uvádÄ›nĂ© vĂ˝sledky. Jedna kapitola se takĂ© zaměřuje na tĂ©ma pĹ™ekrĂ˝vajĂcĂ se Ĺ™eÄŤi a na metody jejĂ detekce.
Experimentálnà část práce pĹ™edkládá praktickĂ© vĂ˝stupy, kterĂ˝ch bylo dosaĹľeno. Experimenty s diarizacĂ se zaměřovaly zejmĂ©na na online systĂ©m zaloĹľenĂ˝ na GMM a na i-vektorovĂ˝ systĂ©m, kterĂ˝ mÄ›l offline i online varianty. ZávÄ›reÄŤná sekce experimentĹŻ takĂ© pĹ™ibliĹľuje novÄ› navrĹľenou metodu pro detekci pĹ™ekrĂ˝vajĂcĂ se Ĺ™eÄŤi, která je zaloĹľena na konvoluÄŤnĂ neuronovĂ© sĂti.ObhájenoThe thesis focuses on the topic of speaker diarization, a speech processing task that is commonly characterized as the question "Who speaks when?". It also addresses the related task of overlapping speech detection, which is very relevant for diarization.
The theoretical part of the thesis provides an overview of existing diarization approaches, both offline and online, and discusses some of the problematic areas which were identified in early stages of the author's research. The thesis also includes an extensive comparison of existing diarization systems, with focus on their reported performance. One chapter is also dedicated to the topic of overlapping speech and the methods of its detection.
The experimental part of the thesis then presents the work which has been done on speaker diarization, which was focused mostly on a GMM-based online diarization system and an i-vector based system with both offline and online variants. The final section also details a newly proposed approach for detecting overlapping speech using a convolutional neural network