80 research outputs found
Automated speech and audio analysis for semantic access to multimedia
The deployment and integration of audio processing tools can enhance the semantic annotation of multimedia content, and as a consequence, improve the effectiveness of conceptual access tools. This paper overviews the various ways in which automatic speech and audio analysis can contribute to increased granularity of automatically extracted metadata. A number of techniques will be presented, including the alignment of speech and text resources, large vocabulary speech recognition, key word spotting and speaker classification. The applicability of techniques will be discussed from a media crossing perspective. The added value of the techniques and their potential contribution to the content value chain will be illustrated by the description of two (complementary) demonstrators for browsing broadcast news archives
Processing and Linking Audio Events in Large Multimedia Archives: The EU inEvent Project
In the inEvent EU project [1], we aim at structuring, retrieving, and sharing large archives of networked, and dynamically changing, multimedia recordings, mainly consisting of meetings, videoconferences, and lectures. More specifically, we are developing an integrated system that performs audiovisual processing of multimedia recordings, and labels them in terms of interconnected âhyper-events â (a notion inspired from hyper-texts). Each hyper-event is composed of simpler facets, including audio-video recordings and metadata, which are then easier to search, retrieve and share. In the present paper, we mainly cover the audio processing aspects of the system, including speech recognition, speaker diarization and linking (across recordings), the use of these features for hyper-event indexing and recommendation, and the search portal. We present initial results for feature extraction from lecture recordings using the TED talks. Index Terms: Networked multimedia events; audio processing: speech recognition; speaker diarization and linking; multimedia indexing and searching; hyper-events. 1
Segmentation, Diarization and Speech Transcription: Surprise Data Unraveled
In this thesis, research on large vocabulary continuous speech recognition for unknown audio conditions is presented. For automatic speech recognition systems based on statistical methods, it is important that the conditions of the audio used for training the statistical models match the conditions of the audio to be processed. Any mismatch will decrease the accuracy of the recognition. If it is unpredictable what kind of data can be expected, or in other words if the conditions of the audio to be processed are unknown, it is impossible to tune the models. If the material consists of `surprise data' the output of the system is likely to be poor. In this thesis methods are presented for which no external training data is required for training models. These novel methods have been implemented in a large vocabulary continuous speech recognition system called SHoUT. This system consists of three subsystems: speech/non-speech classification, speaker diarization and automatic speech recognition. The speech/non-speech classification subsystem separates speech from silence and unknown audible non-speech events. The type of non-speech present in audio recordings can vary from paper shuffling in recordings of meetings to sound effects in television shows. Because it is unknown what type of non-speech needs to be detected, it is not possible to train high quality statistical models for each type of non-speech sound. The speech/non-speech classification subsystem, also called the speech activity detection subsystem, does not attempt to classify all audible non-speech in a single run. Instead, first a bootstrap speech/silence classification is obtained using a standard speech activity component. Next, the models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification. This approach makes it possible to classify speech and non-speech with high accuracy, without the need to know what kinds of sound are present in the audio recording. Once all non-speech is filtered out of the audio, it is the task of the speaker diarization subsystem to determine how many speakers occur in the recording and exactly when they are speaking. The speaker diarization subsystem applies agglomerative clustering to create clusters of speech fragments for each speaker in the recording. First, statistical speaker models are created on random chunks of the recording and by iteratively realigning the data, retraining the models and merging models that represent the same speaker, accurate speaker models are obtained for speaker clustering. This method does not require any statistical models developed on a training set, which makes the diarization subsystem insensitive for variation in audio conditions. Unfortunately, because the algorithm is of complexity , this clustering method is slow for long recordings. Two variations of the subsystem are presented that reduce the needed computational effort, so that the subsystem is applicable for long audio recordings as well. The automatic speech recognition subsystem developed for this research, is based on Viterbi decoding on a fixed pronunciation prefix tree. Using the fixed tree, a flexible modular decoder could be developed, but it was not straightforward to apply full language model look-ahead efficiently. In this thesis a novel method is discussed that makes it possible to apply language model look-ahead effectively on the fixed tree. Also, to obtain higher speech recognition accuracy on audio with unknown acoustical conditions, a selection from the numerous known methods that exist for robust automatic speech recognition is applied and evaluated in this thesis. The three individual subsystems as well as the entire system have been successfully evaluated on three international benchmarks. The diarization subsystem has been evaluated at the NIST RT06s benchmark and the speech activity detection subsystem has been tested at RT07s. The entire system was evaluated at N-Best, the first automatic speech recognition benchmark for Dutch
Development of a Speaker Diarization System for Speaker Tracking in Audio Broadcast News: a Case Study
A system for speaker tracking in broadcast-news audio data is presented and the impacts of the main components of the system to the overall speaker-tracking performance are evaluated. The process of speaker tracking in continuous audio streams
involves several processing tasks and is therefore treated as a multistage process. The main building blocks of such system include the components for audio segmentation, speech detection, speaker clustering and speaker identification. The aim of the first three processes is to find homogeneous regions in continuous audio streams that belong to one speaker and to join each region of the same speaker together. The task of organizing the audio data in this way is known as speaker diarization and plays an important role in various speech-processing applications.
In our case the impact of speaker diarization
was assessed in a speaker-tracking system by performing a comparative study of how each of the component influenced the overall speaker-detection results. The evaluation experiments were performed on broadcast-news audio data with a speaker-tracking system,
which was capable of detecting 41 target speakers. We implemented several different approaches in each component of the system and compared their performances by inspecting the final speaker-tracking results. The evaluation results indicate the importance of the audio-segmentation and speech-detection components, while no significant improvement of the overall results was achieved by additionally including a speaker-clustering component to the speaker-tracking system
Adaptation of speech recognition systems to selected real-world deployment conditions
Tato habilitaÄnĂ prĂĄce se zabĂœvĂĄ problematikou adaptace systĂ©mĆŻ
rozpoznĂĄvĂĄnĂ ĆeÄi na vybranĂ© reĂĄlnĂ© podmĂnky nasazenĂ. Je koncipovĂĄna
jako sbornĂk celkem dvanĂĄcti ÄlĂĄnkĆŻ, kterĂ© se touto problematikou
zabĂœvajĂ. Jde o publikace, jejichĆŸ jsem hlavnĂm autorem
nebo spoluatorem, a kterĂ© vznikly v rĂĄmci nÄkolika navazujĂcĂch
vĂœzkumnĂœch projektĆŻ. Na ĆeĆĄenĂ tÄchto projektĆŻ jsem se
podĂlel jak v roli Älena vĂœzkumnĂ©ho tĂœmu, tak i v roli ĆeĆĄitele nebo
spoluĆeĆĄitele.
Publikace zaĆazenĂ© do tohoto sbornĂku lze rozdÄlit podle tĂ©matu
do tĆĂ hlavnĂch skupin. Jejich spoleÄnĂœm jmenovatelem je
snaha pĆizpĆŻsobit danĂœ rozpoznĂĄvacĂ systĂ©m novĂœm podmĂnkĂĄm Äi
konkrĂ©tnĂmu faktoru, kterĂœ vĂœznamnĂœm zpĆŻsobem ovlivĆuje jeho
funkci Äi pĆesnost.
PrvnĂ skupina ÄlĂĄnkĆŻ se zabĂœvĂĄ Ășlohou neĆĂzenĂ© adaptace na
mluvÄĂho, kdy systĂ©m pĆizpĆŻsobuje svoje parametry specifickĂœm
hlasovĂœm charakteristikĂĄm danĂ© mluvĂcĂ osoby. DruhĂĄ ÄĂĄst prĂĄce
se pak vÄnuje problematice identifikace neĆeÄovĂœch udĂĄlostĂ na vstupu
do systĂ©mu a souvisejĂcĂ Ășloze rozpoznĂĄvĂĄnĂ ĆeÄi s hlukem
(a zejmĂ©na hudbou) na pozadĂ. KoneÄnÄ tĆetĂ ÄĂĄst prĂĄce se zabĂœvĂĄ
pĆĂstupy, kterĂ© umoĆŸĆujĂ pĆepis audio signĂĄlu obsahujĂcĂho promluvy
ve vĂce neĆŸ v jednom jazyce. Jde o metody adaptace existujĂcĂho
rozpoznĂĄvacĂho systĂ©mu na novĂœ jazyk a metody identifikace
jazyka z audio signĂĄlu.
ObÄ zmĂnÄnĂ© identifikaÄnĂ Ășlohy jsou pĆitom vyĆĄetĆovĂĄny zejmĂ©na
v nĂĄroÄnĂ©m a mĂ©nÄ probĂĄdanĂ©m reĆŸimu zpracovĂĄnĂ po jednotlivĂœch
rĂĄmcĂch vstupnĂho signĂĄlu, kterĂœ je jako jedinĂœ vhodnĂœ pro on-line
nasazenĂ, napĆ. pro streamovanĂĄ data.This habilitation thesis deals with adaptation of automatic speech
recognition (ASR) systems to selected real-world deployment conditions.
It is presented in the form of a collection of twelve articles
dealing with this task; I am the main author or a co-author of these
articles. They were published during my work on several consecutive
research projects. I have participated in the solution of them
as a member of the research team as well as the investigator or a
co-investigator.
These articles can be divided into three main groups according to
their topics. They have in common the effort to adapt a particular
ASR system to a specific factor or deployment condition that affects
its function or accuracy.
The first group of articles is focused on an unsupervised speaker
adaptation task, where the ASR system adapts its parameters to
the specific voice characteristics of one particular speaker. The second
part deals with a) methods allowing the system to identify
non-speech events on the input, and b) the related task of recognition
of speech with non-speech events, particularly music, in the
background. Finally, the third part is devoted to the methods
that allow the transcription of an audio signal containing multilingual
utterances. It includes a) approaches for adapting the existing
recognition system to a new language and b) methods for identification
of the language from the audio signal.
The two mentioned identification tasks are in particular investigated
under the demanding and less explored frame-wise scenario,
which is the only one suitable for processing of on-line data streams
FEARLESS STEPS Challenge (FS-2): Supervised Learning with Massive Naturalistic Apollo Data
The Fearless Steps Initiative by UTDallas-CRSS led to the digitization,
recovery, and diarization of 19,000 hours of original analog audio data, as
well as the development of algorithms to extract meaningful information from
this multi-channel naturalistic data resource. The 2020 FEARLESS STEPS (FS-2)
Challenge is the second annual challenge held for the Speech and Language
Technology community to motivate supervised learning algorithm development for
multi-party and multi-stream naturalistic audio. In this paper, we present an
overview of the challenge sub-tasks, data, performance metrics, and lessons
learned from Phase-2 of the Fearless Steps Challenge (FS-2). We present
advancements made in FS-2 through extensive community outreach and feedback. We
describe innovations in the challenge corpus development, and present revised
baseline results. We finally discuss the challenge outcome and general trends
in system development across both phases (Phase FS-1 Unsupervised, and Phase
FS-2 Supervised) of the challenge, and its continuation into multi-channel
challenge tasks for the upcoming Fearless Steps Challenge Phase-3.Comment: Paper Accepted in the Interspeech 2020 Conferenc
Speaker Diarization with Lexical Information
This work presents a novel approach for speaker diarization to leverage
lexical information provided by automatic speech recognition. We propose a
speaker diarization system that can incorporate word-level speaker turn
probabilities with speaker embeddings into a speaker clustering process to
improve the overall diarization accuracy. To integrate lexical and acoustic
information in a comprehensive way during clustering, we introduce an adjacency
matrix integration for spectral clustering. Since words and word boundary
information for word-level speaker turn probability estimation are provided by
a speech recognition system, our proposed method works without any human
intervention for manual transcriptions. We show that the proposed method
improves diarization performance on various evaluation datasets compared to the
baseline diarization system using acoustic information only in speaker
embeddings
- âŠ